Speech-to-Speech

Speech-to-Speech (S2S) routes the entire call through a single realtime model instead of the classic STT → LLM → TTS pipeline. The model listens to the caller, reasons, and speaks back in one tightly- coupled loop. The result is much lower turn-around latency and a more natural conversational feel — at the cost of fewer per-component knobs. Two provider families are first-class:

Google Gemini Live — gemini-2.5-flash-native-audio-preview and gemini-3.1-flash-live-preview.
OpenAI Realtime — gpt-realtime (GA) and gpt-realtime-mini.

When to use it

Use Speech-to-Speech	Stay on Cascaded
You want the lowest latency turn-taking experience available	You need a specific STT or TTS provider for accuracy / language reasons
The call is a relatively focused conversation (booking, FAQ, lead capture)	You rely on heavy mid-call prompt updates or rich tool-call workflows
You’re happy with the provider’s built-in voices	You want voice cloning or a specific brand voice

Switching modes

The Voice Configuration screen has a Cascaded ↔ Speech-to-Speech toggle in the top-right of the header. Toggling Speech-to-Speech:

Hides the cascaded STT and TTS tabs
Shows a new S2S tab (between TTS and Detection in the cascaded layout)
Auto-routes you to the S2S tab so all settings are immediately visible

Switching back to Cascaded restores STT and TTS and drops the S2S tab.

The S2S tab

The S2S tab uses a sidebar layout that mirrors the STT and TTS tabs:

Left column — the realtime model list, grouped by provider.
Right column — settings cards for the selected model. The cards reshape themselves based on the active provider (Vertex AI hosting and conversational tuning are Gemini-only; the voice catalog swaps in automatically).

Model

Four realtime models are user-selectable:

Provider	Model	When to pick it
Google Gemini Live	Gemini 2.5 Flash (recommended)	Production. Full feature support — agent handoffs, mid-session prompt updates, parallel tool calls, conversational tuning.
Google Gemini Live	Gemini 3.1 Flash (experimental)	Lowest latency. Single-turn / stateless voice tasks only. Agent handoffs, mid-session prompt edits, and conversational tuning are unavailable; tool calls block the model.
OpenAI Realtime	GPT Realtime (GA)	OpenAI’s production realtime model. Semantic-VAD turn detection, full tool calling, the new `marin` / `cedar` GA voices. ~ $32 /$ 64 per M audio input/output tokens.
OpenAI Realtime	GPT Realtime Mini	~60% cheaper than `gpt-realtime`. Use for high-volume or cost-sensitive voice agents. ~ $13 /$ 26 per M audio tokens.

Selecting a model from the sidebar reshapes the settings panel for that provider — voice catalog, temperature range, hosting options, and the conversational-tuning card all swap in or out automatically.

Voice

Gemini Live (30 prebuilt voices)

A searchable picker grouped by character so the list stays navigable:

Popular — Puck (Upbeat), Kore (Firm), Charon (Informative), Fenrir (Excitable), Aoede (Breezy)
Bright & Lively — Zephyr, Autonoe, Laomedeia, Sadachbia, Pulcherrima
Warm & Friendly — Sulafat, Achird, Vindemiatrix, Achernar, Leda
Smooth & Even — Algieba, Despina, Schedar, Callirrhoe, Umbriel, Zubenelgenubi
Clear & Informative — Iapetus, Erinome, Rasalgethi, Sadaltager
Firm & Mature — Orus, Alnilam, Algenib, Gacrux, Enceladus

OpenAI Realtime (10 voices)

GA voices (gpt-realtime) — marin (natural, conversational — default), cedar (warm, grounded). Tuned for gpt-realtime’s native-audio pipeline.
Legacy voices — alloy, ash, ballad, coral, echo, sage, shimmer, verse. Inherited from earlier Realtime previews.

Switching providers via the sidebar automatically resets the voice to the new provider’s default (Puck for Gemini, marin for OpenAI) if the previously-selected voice doesn’t exist in the new catalog.

Sampling

Temperature
- Gemini Live — 0.0–2.0, default 0.8. Higher = more creative, lower = more deterministic.
- OpenAI Realtime — 0.6–1.2, default 0.8. The OpenAI API rejects values outside this band; the slider clamps automatically when you switch providers so the saved value always stays valid.
Max Output Tokens (default 0 = no cap) — limits reply length to keep turns snappy. Applied on Gemini Live at construction; on OpenAI Realtime the upstream plugin doesn’t expose a constructor-level cap, so this setting is currently advisory on OpenAI.

Conversational Tuning (Gemini-only)

Affective Dialog — adjusts tone to match the caller’s emotional state. Best for support / empathy use cases.
Proactivity — allows the model to stay silent when no response is appropriate (background chatter, the caller talking to someone else).

This card is hidden entirely when an OpenAI model is selected. Within Gemini, both toggles are silently ignored on Gemini 3.1.

Turn Detection (OpenAI only)

How GPT Realtime decides the caller has finished speaking. Two modes:

Semantic VAD (default) — uses a classifier over the caller’s words. Less likely to chunk mid-sentence; friendlier to hesitant callers. Configurable eagerness: auto (≈ medium, tuned default), low (lets callers take their time), medium, or high (chunks audio as soon as possible).
Server VAD — silence-based. Configurable sliders:
- Threshold (0–1, default 0.5) — energy bar for “this is speech”. Raise for noisy environments; lower for soft-spoken callers.
- Prefix padding (default 300 ms) — audio kept before detected speech, so plosives and soft starts aren’t clipped.
- Silence duration (default 500 ms) — required silence to mark end-of-turn. Lower = snappier; higher = safer.
Plugin default — don’t send a turn-detection config at all; the LiveKit plugin’s built-in default applies.

Two response toggles apply to both modes:

Auto-generate reply (default on) — model starts replying as soon as turn detection fires.
Allow interruption (default on) — caller speech during the agent’s reply interrupts the current response.

The card has a reset button that restores all knobs to the LiveKit / OpenAI recommended defaults.

Thinking (Gemini native-audio only)

Gemini’s native-audio models think before responding. By default this chain-of-thought stays internal; toggling Include thoughts in transcript on forwards the model’s thoughts to your transcript alongside the normal output. Useful for debugging reasoning, noisy for production logs. Off by default; reset button restores the default.

Custom Turn Detection (Gemini only)

Disables Gemini’s built-in VAD and routes input audio through LiveKit’s MultilingualModel turn detector + a separate STT plugin you choose. Gemini Live still speaks the reply — only the input path changes. When this toggle is on, the STT tab re-appears in the tab bar (it’s normally hidden in S2S mode). Configure your input STT there, then jump to the Detection tab to pick the turn-detection mode (turn_detector / stt_endpointing / vad_only) and endpointing thresholds. Useful when:

Callers are in noisy environments and you trust a dedicated STT (e.g. Deepgram Nova-3) more than Gemini’s built-in VAD.
You need an STT-specific feature on the input side (custom vocabulary, language detection, etc.).
You want LiveKit’s MultilingualModel to do end-of-utterance prediction across 100+ languages.

Off by default.

Hosting

The hosting card adapts to the selected provider:

Option	Available on	Who pays the provider	Who pays the platform fee
Platform (default)	Both	Platform-managed key, billed via your thinnestAI plan	Yes (your trial / PAYG)
BYO Gemini API key	Gemini only	You — paste a key from aistudio.google.com/app/apikey. Bills directly to your Google account.	Yes — only the platform fee
BYO OpenAI API key	OpenAI only	You — paste a key from platform.openai.com/api-keys. Bills Realtime API usage to your OpenAI account.	Yes — only the platform fee
Vertex AI	Gemini only	You — routes through your own GCP project. Requires `GOOGLE_APPLICATION_CREDENTIALS` (service account JSON) on the deployment. Project + region are optional overrides.	Yes — only the platform fee

Keys are stored encrypted at rest. The key prefix is validated in-browser (AIza… for Gemini, sk-… for OpenAI) and a warning surfaces if the pasted value doesn’t match the expected provider — this prevents accidentally sending a Gemini key to OpenAI or vice versa. When you toggle Vertex AI on, the BYO key field is auto-cleared, and vice versa. Switching from a Gemini model to an OpenAI model also automatically disables Vertex AI (which is Gemini-specific).

Half-cascade — use your own TTS

S2S has an optional Use custom TTS toggle inside the S2S tab. When enabled:

The realtime model runs in TEXT modality — it listens and reasons, but emits text instead of audio.
Your selected TTS plugin (Cartesia, ElevenLabs, Sarvam, Aero, etc.) speaks the reply.
Click the gear icon next to the toggle to open an embedded TTS picker modal — same provider/model/voice options as the cascaded TTS tab.

Half-cascade is useful when you want the realtime model’s listening + reasoning quality but need a specific brand voice or language coverage the provider doesn’t ship.

Half-cascade is incompatible with Gemini’s native-audio models — they reject TEXT modality at the API level. The toggle is automatically disabled and forced off when a native-audio model (the current Gemini 2.5 default) is selected. Switch to Gemini 3.1 Flash or either OpenAI Realtime model to enable half-cascade.

Cost breakdown in S2S

When S2S mode is active, the cost-breakdown popup collapses STT + TTS + LLM into a single row whose label tracks the selected provider:

Pure S2S (native audio) — Hosting / <Gemini Live | GPT Realtime | GPT Realtime Mini> / Telephony / Krisp NC.
Half-cascade — the TTS row is restored to reflect the cascaded TTS plugin you picked.
BYOK (Vertex AI, BYO Gemini key, or BYO OpenAI key) — the model row shows as Free because you pay the provider directly. The platform only charges its fee.

Per-minute list-price estimates:

Model	Pricing	Approx. ₹/min (non-BYOK)
Gemini 2.5 Flash (live)	$3 /$ 12 per M audio tokens	~₹6.80/min
Gemini 3.1 Flash (live)	$3 /$ 12 per M audio tokens	~₹6.80/min
GPT Realtime	$32 /$ 64 per M audio tokens	~₹9.05/min
GPT Realtime Mini	$13 /$ 26 per M audio tokens	~₹3.65/min

Estimates assume typical 1-min audio token counts; actual usage varies with how much the caller and agent each speak.

Knowledge Base in S2S mode

In Cascaded mode, knowledge retrieval is automatic: every user turn triggers a hybrid search and the top passages are injected into the model’s chat context before it responds. The realtime models used in S2S mode don’t expose that hook — Gemini Live and OpenAI Realtime handle turn detection internally and skip the cascaded STT → LLM step where injection would normally happen. To keep KB grounding working under S2S, thinnestAI registers a search_knowledge_base function tool whenever knowledge is attached to the agent. This is the standard pattern documented by both providers:

The realtime model decides when to call the tool. Pricing, features, hours, policy, product specs — anything documented in your knowledge base routes through it; chitchat (“How are you?”) does not.

Tool signature

{
  "name": "search_knowledge_base",
  "description": "Search the company's internal knowledge base for facts about products, services, pricing, plans, features, policies, hours, contact info, technical specs, or any company-specific information. Always call this tool instead of guessing whenever the user asks about anything the company would have documented. Pass the user's question (or a key phrase from it) as the query.",
  "parameters": {
    "query": "string — the user's question or a key phrase from it"
  }
}

The tool returns up to six concatenated passages (1500 chars each, separated by \n\n[KB N] … markers). The model uses them to compose a grounded reply in audio.

Latency profile

Scenario	Round-trip cost
User asks a documented question	~300-600 ms (embedding + vector search + tool round-trip)
User asks chitchat	0 ms — model doesn’t call the tool
Multi-turn follow-up on the same topic	0 ms if the model can reuse the prior tool result from its context window

The tool runs against the same hybrid pgvector index that powers cascaded retrieval; tenant-isolation guards (per-agent source_ids) apply identically.

Tenant isolation

The tool is bound to the specific knowledge sources attached to the agent. When the model invokes search_knowledge_base(query=…), the underlying search is filtered to those source IDs — the tool cannot leak cross-tenant content even if invoked with a misleading query. If no sources are configured, the tool returns “Knowledge base is not configured for this agent.” instead of falling back to an unfiltered search.

What you need to do

Nothing. Attach knowledge to your agent the usual way (Knowledge tab → pick or upload sources) and switch to S2S — the tool is registered automatically when the agent boots. Confirm registration in your logs:

[KB Tool] Registered search_knowledge_base tool (sources=3) — usable by S2S realtime models and as a cascaded-mode fallback

The same tool is also registered in Cascaded mode as a fallback for multi-turn cases where the prior turn’s auto-injection didn’t cover the new topic. In practice the cascaded model rarely calls it because context is already in chat_ctx.

If you previously saw S2S agents hallucinate company-specific information (e.g. inventing pricing or policies), make sure your knowledge sources are actually attached to the agent. The search_knowledge_base tool is only registered when knowledge is present — without it, the realtime model falls back to its training data.

Greeting behaviour

Configuration	Greeting trigger
Pure S2S, Gemini 2.5	Gemini Live speaks the configured greeting on its first turn — baked into the model’s system instructions as an opening-line block.
Pure S2S, OpenAI Realtime	The realtime model speaks the configured greeting on its first turn — passed inline through the agent’s initial generate-reply trigger.
Pure S2S, Gemini 3.1	Gemini 3.1 doesn’t support agent-initiated turns. The user must speak first to start the conversation. Switch to Gemini 2.5, OpenAI Realtime, or enable Custom TTS for an agent-initiated greeting.
S2S half-cascade (any model)	Your cascaded TTS plugin speaks the greeting via `session.say()`.

Quick start

Open your agent in Agent Studio → Voice Configuration.
In the header, click Speech-to-Speech.
The S2S tab opens with the model sidebar on the left. Pick Gemini 2.5 Flash (recommended default), or click GPT Realtime to try OpenAI.
Leave Voice on the model’s default (Puck for Gemini, marin for OpenAI), or pick another from the dropdown.
(Optional) Open the Hosting card and paste your own Gemini or OpenAI key.
Save and click Try Voice Call.

Next Steps

Voice Configuration — pick STT/TTS providers for Cascaded mode.
Noise Cancellation — choose a noise-cancellation engine and dial in its strength.
Audio Ambience — add background sound and thinking-state filler audio.

Introduction

Getting Started

Voice Agents

Agent Capabilities

Channels

Quality & Oversight

Platform

Speech-to-Speech (Gemini Live + OpenAI Realtime)

Speech-to-Speech

When to use it

Switching modes

The S2S tab

Model

Voice

Gemini Live (30 prebuilt voices)

OpenAI Realtime (10 voices)

Sampling

Conversational Tuning (Gemini-only)

Turn Detection (OpenAI only)

Thinking (Gemini native-audio only)

Custom Turn Detection (Gemini only)

Hosting

Half-cascade — use your own TTS

Cost breakdown in S2S

Knowledge Base in S2S mode

Tool signature

Latency profile

Tenant isolation

What you need to do

Greeting behaviour

Quick start

Next Steps

Introduction

Getting Started

Voice Agents

Agent Capabilities

Channels

Quality & Oversight

Platform

Documentation Index

​Speech-to-Speech

​When to use it

​Switching modes

​The S2S tab

​Model

​Voice

​Gemini Live (30 prebuilt voices)

​OpenAI Realtime (10 voices)

​Sampling

​Conversational Tuning (Gemini-only)

​Turn Detection (OpenAI only)

​Thinking (Gemini native-audio only)

​Custom Turn Detection (Gemini only)

​Hosting

​Half-cascade — use your own TTS

​Cost breakdown in S2S

​Knowledge Base in S2S mode

​Tool signature

​Latency profile

​Tenant isolation

​What you need to do

​Greeting behaviour

​Quick start

​Next Steps

Speech-to-Speech

When to use it

Switching modes

The S2S tab

Model

Voice

Gemini Live (30 prebuilt voices)

OpenAI Realtime (10 voices)

Sampling

Conversational Tuning (Gemini-only)

Turn Detection (OpenAI only)

Thinking (Gemini native-audio only)

Custom Turn Detection (Gemini only)

Hosting

Half-cascade — use your own TTS

Cost breakdown in S2S

Knowledge Base in S2S mode

Tool signature

Latency profile

Tenant isolation

What you need to do

Greeting behaviour

Quick start

Next Steps