Real-time Processing
⭐ Real-time processing enables live audio/video streaming with bidirectional communication.
Overview
GenAI Processors supports two approaches for real-time processing:
- Gemini Live API
(
live_model.LiveProcessor): for native bidirectional streaming with Gemini Live API. It is efficient but less flexible and is Gemini-specific as it relies on a server-side implementation. - Turn-based Real-time
(
realtime.LiveProcessor): a client-side, hackable alternative to the Gemini Live API that wraps any turn-based non-streaming model into a bidirectional streaming API.
This document focuses on Turn-based Real-time with realtime.LiveProcessor.
Turn-Based Real-time with realtime.LiveProcessor
When you want to build a voice agent using standard models (non-Live API), you
can use realtime.LiveProcessor to convert a turn-based model into a real-time
processor. It takes an infinite input stream, creates a rolling prompt from it
by cutting it at given times (e.g. when the user is done talking), and feeds
this prompt to the turn_processor to generate a response.
from genai_processors.core import genai_model
from genai_processors.core import realtime
model = genai_model.GenaiModel("gemini-2.0-flash")
realtime_proc = realtime.LiveProcessor(model)
How it Works
realtime.LiveProcessor manages a conversation loop: - It uses
window.RollingPrompt to maintain conversation history within a sliding window
(by default it keeps parts up to duration_prompt_sec). - It listens for
signals like speech_to_text.StartOfSpeech and speech_to_text.EndOfSpeech
(typically from a VAD or STT processor) to detect user speech and silence. - It
triggers a call to the turn_processor when the user finishes speaking, or when
a final transcription is available, depending on AudioTriggerMode, or when the
client sends a content_api.end_of_turn() part. - It supports interruption:
if the user starts speaking while the model is generating a response, the
generation is cancelled.
Triggering Model Turns from Voice Signals
You can configure when to trigger a model call using trigger_model_mode:
AudioTriggerMode.END_OF_SPEECH: Trigger model when user stops talking. This is faster and suitable for audio-based models.AudioTriggerMode.FINAL_TRANSCRIPTION: Trigger model when the final transcription is available. This is more suitable for text-based models but adds slight latency.
The default is FINAL_TRANSCRIPTION.
realtime_proc = realtime.LiveProcessor(
model,
trigger_model_mode=realtime.AudioTriggerMode.END_OF_SPEECH,
)
Voice Activity Detection (VAD)
To generate StartOfSpeech and EndOfSpeech signals, you can use the
speech_to_text
module that uses the Cloud Speech API. You can also use your own VAD logic, the
only requirement is to output speech_to_text.StartOfSpeech and
speech_to_text.EndOfSpeech.
from genai_processors.core import speech_to_text
stt_processor = speech_to_text.SpeechToText(...)
# Chain: STT -> realtime processor
pipeline = stt_processor + realtime_proc
RollingPrompt and Windowing
realtime.LiveProcessor uses RollingPrompt to manage conversation history
efficiently for long-running sessions. A custom context compression policy can
be supplied, but by default it keeps the prompt within a certain duration by
dropping old parts. RollingPrompt is part of the
window
module.
from genai_processors.core import window
rolling = window.RollingPrompt(
duration_prompt_sec=300, # Keep 5 minutes of history
)
For more control over windowing behavior, you can use window.Window to invoke
a processor on a sliding window of conversation turns.
from genai_processors.core import window
rolling = window.Window(
window_processor = turn_processor,
compress_history = window.keep_last_n_turns(5),
)
The compress_history defines how history should be compressed when calling the
window_processor.
drop_old_parts(age_sec): Remove parts older than a specified age.
keep_last_n_turns(n): Keep only the last N conversation turns.
Generating Audio from Text Outputs
When using a text-based LLM, you can use the
text_to_speech
module to generate audio output from text. It is based on the Google
Text-To-Speech API but here again, you can define your own processor to create
audio parts, just replace the tts_processor below with your own
implementation.
from genai_processors.core import speech_to_text
from genai_processors.core import text_to_speech
stt_processor = speech_to_text.SpeechToText(...)
tts_processor = text_to_speech.TextToSpeech(...)
# Chain: STT -> realtime processor -> TTS
pipeline = stt_processor + realtime_proc + tts_processor
Models usually generate audio much faster than they can be played back. This
creates a challenge when a user tries to interrupt the model: once audio hits
the playback buffer, it can't be "recalled." To fix this, the RateLimitAudio
processor (from the
rate_limit_audio
module) buffers the output and throttles it to real-time speed. This ensures the
model's output stays synced with the audio the user actually hears, making
interruptions feel natural.
from genai_processors.core import rate_limit_audio
from genai_processors.core import speech_to_text
from genai_processors.core import text_to_speech
stt_processor = speech_to_text.SpeechToText(...)
tts_processor = text_to_speech.TextToSpeech(...)
rate_limiter = rate_limit_audio.RateLimitAudio(sample_rate=24000)
# Chain: STT -> realtime processor -> TTS -> Rate Limiter
pipeline = stt_processor + realtime_proc + tts_processor + rate_limiter
Audio/Video I/O
GenAI processors provides convenient processors to capture or render multi-modal inputs and outputs.
Microphone Input
The
audio_io
module defines an audio source processor that can be used in any pipeline.
from genai_processors import streams
from genai_processors.core import audio_io
import pyaudio
pya = pyaudio.PyAudio()
pipeline = audio_io.PyAudioIn(pya) + speech_to_text.SpeechToText(...) + ...
async for part in pipeline(streams.endless_stream()):
# The pipeline listens to the mic and generate audio parts.
...
Speaker Output
When the model outputs raw audio, you can use the audio_io.PyAudioOut
processor to play the audio parts on the default speaker.
from genai_processors import streams
from genai_processors.core import audio_io
import pyaudio
pya = pyaudio.PyAudio()
# model is an text-in, raw audio-out LLM.
pipeline = audio_io.PyAudioIn(pya) + speech_to_text.SpeechToText(...) + realtime_proc
pipeline = pipeline + audio_io.PyAudioOut(pya)
async for part in pipeline(streams.endless_stream()):
# The pipeline listens to the mic and sends any audio output of the LLM to
# the default speaker.
pass
Note that the library doesn't include built-in echo cancellation, so the model may "hear" and respond to its own output. The simplest fix is to rely on your web browser's native echo cancellation; this is why we typically run our agent UIs in AI Studio apps. If you aren't using a browser-based UI, we recommend using headphones to keep the model's audio separate from the microphone.
Video Input
The video module contains processor sources to capture images from a camera or from your computer screen. It is used the same way as audio inputs:
from genai_processors.core import realtime
from genai_processors.core import video
from genai_processors import streams
realtime_proc = realtime.LiveProcessor(
model,
trigger_model_mode=realtime.AudioTriggerMode.END_OF_SPEECH,
)
pipeline = video.VideoIn() + realtime_proc
async for part in pipeline(streams.endless_stream()):
# the pipeline receives frames from the default camera (default 1 FPS).
...
Complete Example: Live Voice Agent
See the real-time simple cli example to explore how to define a straightforward real-time agent (audio only) with a chain of processors, handling interruptions and text entries smoothly.