This page is taken from the examples/live/ directory and refers to the code there.
Live Commentator Example đī¸
The Live Commentator is a dynamic conversational agent that actively drives the interaction, continuously generating commentary on input video and audio, asking questions, and guiding the user towards accomplishing tasks. The user can interact by asking questions, sharing a screen, or providing a camera feed. The Live Commentator seamlessly transitions between these different scenarios, adapting its commentary to the changing context. It utilizes Gemini's multimodal capabilities and asynchronous function calls to provide an engaging and responsive experience.
đ Running the commentator
We recommend the AI Studio version of the agent as it utilizes the echo cancellation built into the browser. Please see commentator_ais.py for instructions.
đŖī¸ Conversation Logic
The Live Commentator agent logic is fully contained in the LiveCommentator
class in commentator.py and is developed as a GenAI Processor. This makes it
easier to combine with other processors such as EventDetection or
RateLimitAudio.
The Live Commentator acts as a state machine, transitioning between different states based on user input, detected events in the video stream, and the model's responses. It orchestrates the flow of data between the user, the event detection processor, and the Gemini Live API, ensuring timely and relevant commentary.
The key to the LiveCommentator's behavior is its ability to interrupt its own commentary when notable new input is observed. This is achieved through a combination of:
- Event Detection: A separate
EventDetectionprocessor analyzes the video stream for significant events (e.g., a person appearing, a new object being shown). - Voice Activity Detection (VAD): Built into the Gemini Live API, VAD detects when the user is speaking, allowing the user to interrupt the commentary.
- Async Function Calls: The Gemini API supports non-blocking function
calls. The
LiveCommentatoruses this to schedule future commentary without interrupting the current utterance. This allows for a more natural and responsive experience. One async function call,wait_for_user, is returned by Gemini when the Live Commentator needs to wait for the user to answer a question or to do something. It waits for a set duration (currently 5 seconds) and resumes the commentary if no relevant input is received.
đ State Machine
The LiveCommentator operates based on the following states and transitions.
EVENT_DETECTION_(START|STOP) are only for this schema (you will not find them
in code) and signify a request from the event detection processor to start or
stop the commentary. All the other states and actions are present in the code.
---
title: Live Commentary State Machine
---
graph TD
A{OFF} -->|EVENT_DETECTION_START| B{TALKING};
B --> |EVENT_DETECTION_STOP| A
B -->|INTERRUPT|C{USER_IS_TALKING}
B -->|REQUEST_INTERRUPT|D(REQUESTING_INTERRUPTION)
B -->|WAIT_FOR_USER| H{WAITING_FOR_USER}
B -->|REQUEST_FROM_COMMENTATOR| E(REQUESTING_COMMENT);
B -->|REQUEST_FROM_USER| F(REQUESTING_RESPONSE);
D -->|INTERRUPT| G(INTERRUPTED_FROM_DETECTION);
D -->|STREAM_MEDIA_PART| B;
E -->|STREAM_MEDIA_PART| B;
F -->|STREAM_MEDIA_PART| B;
G -->|STREAM_MEDIA_PART| B;
E --> |REQUEST_INTERRUPT|D;
H --> |REQUEST_INTERRUPT|D;
H -->|REQUEST_FROM_COMMENTATOR| E;
C -->|STREAM_MEDIA_PART| B
subgraph legend
I{State}
J(Short-lived State)
end
style A fill:#888,stroke:#333,stroke-width:2px
style B fill:#888,stroke:#333,stroke-width:2px
style C fill:#888,stroke:#333,stroke-width:2px
style D fill:#888,stroke:#333,stroke-width:2px
style E fill:#888,stroke:#333,stroke-width:2px
style F fill:#888,stroke:#333,stroke-width:2px
style G fill:#888,stroke:#333,stroke-width:2px
style H fill:#888,stroke:#333,stroke-width:2px
âšī¸ States
- OFF: The commentator is inactive. No commentary is generated. The user can still ask questions, but the commentator is not automatically generating output.
- TALKING: The commentator is actively generating commentary or responding to the user. This is the "steady state" of operation.
- USER_IS_TALKING: The user has started speaking (detected by VAD). The commentator is waiting for the user to finish. The commentator stopped talking.
- REQUESTING_INTERRUPTION: An event has been detected that requires interrupting the current commentary. The commentator is about to send a request to the model to generate an interruptive comment. The commentator will keep talking while the request is sent.
- REQUESTING_COMMENT: The commentator is requesting a new comment from the model. This happens when the commentator is turned on or when a scheduled comment is due. This happens when the commentator will soon finish its current comment.
- REQUESTING_RESPONSE: The commentator is requesting a response to a user's question or input. This input is text-based. This is not used in the AI Studio version of the commentator (no text entry).
- INTERRUPTED_FROM_DETECTION: The commentary should be interrupted by an event and the system is waiting for the model to generate the new audio. The commentator keeps talking until it receives the response of the model generating the interruption.
- WAITING_FOR_USER: The commentator has finished speaking (perhaps it
asked a question) and is waiting for the user to respond or for an event to
occur. The commentator will only wait for a fixed amount of time; if the
user doesn't respond within
MAX_SILENCE_WAIT_FOR_USER_SEC, the commentator will resume commentating.
âļī¸ Actions
The CommentatorStateMachine uses the following Action enum to trigger state
transitions:
- TURN_ON: Activates the commentator. The model will begin generating
commentary. This action is triggered by the
start_commentatingasync function call. The async nature of the function is what enables the flexible scheduling of the commentary: it can interrupt, wait until the current comment is done, or just be silent and let the user talk. - TURN_OFF: Deactivates the current commentator. The user can still ask questions, but the commentator will not automatically generate commentaries.
- STREAM_MEDIA_PART: New audio has been received from the model. This
action updates the internal state related to the current generation request
(latency, audio duration, etc.) and typically transitions the state to
TALKING. - REQUEST_FROM_USER: A request has been received from the user (e.g., a question or command). This triggers a request to the model.
- REQUEST_INTERRUPT: The
EventDetectionprocessor has detected a significant event. This prepares for an interruption, but doesn't immediately stop the current audio. The state machine waits for the model to confirm the interruption. - REQUEST_FROM_COMMENTATOR: A scheduled comment is due. This triggers a request to the model to generate a new comment.
- INTERRUPT: The user has interrupted the commentator (via VAD) or an event has been detected and the model has confirmed the interruption. This immediately stops the current audio output.
- WAIT_FOR_USER: The model has decided to pause the commentary and wait
for the user to respond or provide visual input. This is triggered by the
wait_for_userasync function call.
đ State Transitions and Logic
The CommentatorStateMachine's update method defines the state transitions.
Here's a breakdown of the key transitions, illustrating how the commentator
responds to various events:
-
Starting Commentary (
OFF->TALKING):- The event detection detects people in front of the camera: it sends a "start commentating" request to the model.
- The model triggers the
start_commentatingfunction. - The
LiveCommentatorreceives the function call. - The state transitions to
TALKING. - A request is sent to the model to generate commentary
(
REQUEST_FROM_COMMENTATOR).
-
User Interrupts (
TALKING->USER_IS_TALKING->REQUESTING_RESPONSE):- The Gemini Live API's VAD detects the user speaking.
- The
LiveCommentatorreceives aninterruptedsignal. - The state transitions to
USER_IS_TALKING. - The current audio output is stopped.
- After a short delay (waiting for the user to finish their utterance),
the state transitions to
REQUESTING_RESPONSE, and a request is sent to the model to generate a response to the user.
-
Event Interrupts (
TALKING->REQUESTING_INTERRUPTION->INTERRUPTED_FROM_DETECTION->TALKING):- The
EventDetectionprocessor detects a significant event. - The
LiveCommentatorreceives aninterrupt_requestsignal. - The state transitions to
REQUESTING_INTERRUPTION. - A request is sent to the model to generate a comment about the detected
event (
REQUEST_INTERRUPT). The model's response confirms the interruption. - When the model confirms the interruption (by sending an
interruptedresponse), the state transitions toINTERRUPTED_FROM_DETECTION. - When the first audio part of the interrupting comment is received, the
LiveCommentatoryields theinterruptedresponse to the rate limit audio processor, stopping the previous audio stream. It starts playing the new audio straightaway and transitions toTALKING.
- The
-
Scheduled Comments (
TALKING->REQUESTING_COMMENT->TALKING):- The
LiveCommentatortracks the duration of the current commentary. - Before the current commentary finishes, a new comment is scheduled at a
time that would make the model output the content of the new comment
just after the current one is fully played. This scheduling is based on
an estimate of TTFT (see âąī¸ timing and latency section below). The
state is still in
TALKING. - When the scheduled time arrives, the state transitions to
REQUESTING_COMMENT. - A request is sent to the model to generate the next comment
(
REQUEST_FROM_COMMENTATOR). - When the first audio part of the new comment is received, the state
transitions to
TALKING.
- The
-
Waiting for User (
TALKING->WAITING_FOR_USER):- The
LiveCommentatorreceived await_for_userasync tool call from the model. - The state transitions to
WAITING_FOR_USER. - The current audio output is not interrupted. It is played until the end.
- Then a timer is started. If the user doesn't interrupt within
MAX_SILENCE_WAIT_FOR_USER_SEC, a new comment is scheduled. - If the user does interrupt by responding, the state transitions to
USER_IS_TALKINGand the logic from step 2 applies. - If an event is detected, the state transitions as in step 3.
- The
-
Stopping Commentary (
*->OFF):- The
EventDetectionprocessor detects that there is no one in front of the camera. It sends a "stop commentating" request to the model. - The
LiveCommentatorreceives a tool call cancellation from the model - The state transitions to
OFF. - Any scheduled comments are cancelled.
- The
âąī¸ Timing and Latency
The LiveCommentator keeps track of the latency of each generation request,
measured as the time from when the request is sent to the model, to when the
first audio part is received. The LiveCommentator uses a history of recent
latency values (also called "TTFT," or "time to first token," in the code) to
predict the latency of the next comment. This prediction is used to schedule
the next comment just before the current comment finishes, minimizing the gap
between utterances while still allowing for interruptions. The
predict_next_ttft() method calculates this predicted TTFT (using a mean-std
approach). The tentative_trigger_time() method calculates the time at which
the next comment should be triggered, based on the predicted TTFT and the
duration of the current audio.
âšī¸ Usage Notes
- The
LiveCommentatorexpects audio and video input on therealtimesubstream. Other substreams are used for control signals. - The
EventDetectionprocessor is crucial for responsive interruptions. Its configuration (prompt, sensitivity) significantly impacts the commentator's behavior. - The
RateLimitAudioprocessor is typically used after theLiveCommentatorto ensure the audio output is played back at the correct speed and can be interrupted properly.
đ License
This example is licensed under the Apache License, Version 2.0.