Documentation Index
Fetch the complete documentation index at: https://docs.thinnest.ai/llms.txt
Use this file to discover all available pages before exploring further.
Speech-to-Speech
Speech-to-Speech (S2S) routes the entire call through a single
realtime model instead of the classic STT → LLM → TTS pipeline. The
model listens to the caller, reasons, and speaks back in one tightly-
coupled loop. The result is much lower turn-around latency and a more
natural conversational feel — at the cost of fewer per-component knobs.
Two provider families are first-class:
- Google Gemini Live —
gemini-2.5-flash-native-audio-preview and gemini-3.1-flash-live-preview.
- OpenAI Realtime —
gpt-realtime (GA) and gpt-realtime-mini.
When to use it
| Use Speech-to-Speech | Stay on Cascaded |
|---|
| You want the lowest latency turn-taking experience available | You need a specific STT or TTS provider for accuracy / language reasons |
| The call is a relatively focused conversation (booking, FAQ, lead capture) | You rely on heavy mid-call prompt updates or rich tool-call workflows |
| You’re happy with the provider’s built-in voices | You want voice cloning or a specific brand voice |
Switching modes
The Voice Configuration screen has a Cascaded ↔ Speech-to-Speech
toggle in the top-right of the header. Toggling Speech-to-Speech:
- Hides the cascaded STT and TTS tabs
- Shows a new S2S tab (between TTS and Detection in the cascaded layout)
- Auto-routes you to the S2S tab so all settings are immediately visible
Switching back to Cascaded restores STT and TTS and drops the S2S tab.
The S2S tab
The S2S tab uses a sidebar layout that mirrors the STT and TTS tabs:
- Left column — the realtime model list, grouped by provider.
- Right column — settings cards for the selected model. The cards
reshape themselves based on the active provider (Vertex AI hosting
and conversational tuning are Gemini-only; the voice catalog swaps
in automatically).
Model
Four realtime models are user-selectable:
| Provider | Model | When to pick it |
|---|
| Google Gemini Live | Gemini 2.5 Flash (recommended) | Production. Full feature support — agent handoffs, mid-session prompt updates, parallel tool calls, conversational tuning. |
| Google Gemini Live | Gemini 3.1 Flash (experimental) | Lowest latency. Single-turn / stateless voice tasks only. Agent handoffs, mid-session prompt edits, and conversational tuning are unavailable; tool calls block the model. |
| OpenAI Realtime | GPT Realtime (GA) | OpenAI’s production realtime model. Semantic-VAD turn detection, full tool calling, the new marin / cedar GA voices. ~32/64 per M audio input/output tokens. |
| OpenAI Realtime | GPT Realtime Mini | ~60% cheaper than gpt-realtime. Use for high-volume or cost-sensitive voice agents. ~13/26 per M audio tokens. |
Selecting a model from the sidebar reshapes the settings panel for that
provider — voice catalog, temperature range, hosting options, and the
conversational-tuning card all swap in or out automatically.
Voice
Gemini Live (30 prebuilt voices)
A searchable picker grouped by character so the list stays navigable:
- Popular — Puck (Upbeat), Kore (Firm), Charon (Informative), Fenrir (Excitable), Aoede (Breezy)
- Bright & Lively — Zephyr, Autonoe, Laomedeia, Sadachbia, Pulcherrima
- Warm & Friendly — Sulafat, Achird, Vindemiatrix, Achernar, Leda
- Smooth & Even — Algieba, Despina, Schedar, Callirrhoe, Umbriel, Zubenelgenubi
- Clear & Informative — Iapetus, Erinome, Rasalgethi, Sadaltager
- Firm & Mature — Orus, Alnilam, Algenib, Gacrux, Enceladus
OpenAI Realtime (10 voices)
- GA voices (gpt-realtime) —
marin (natural, conversational — default), cedar (warm, grounded). Tuned for gpt-realtime’s native-audio pipeline.
- Legacy voices —
alloy, ash, ballad, coral, echo, sage, shimmer, verse. Inherited from earlier Realtime previews.
Switching providers via the sidebar automatically resets the voice to
the new provider’s default (Puck for Gemini, marin for OpenAI) if
the previously-selected voice doesn’t exist in the new catalog.
Sampling
- Temperature
- Gemini Live — 0.0–2.0, default 0.8. Higher = more creative, lower = more deterministic.
- OpenAI Realtime — 0.6–1.2, default 0.8. The OpenAI API rejects values outside this band; the slider clamps automatically when you switch providers so the saved value always stays valid.
- Max Output Tokens (default 0 = no cap) — limits reply length to
keep turns snappy. Applied on Gemini Live at construction; on OpenAI
Realtime the upstream plugin doesn’t expose a constructor-level cap,
so this setting is currently advisory on OpenAI.
Conversational Tuning (Gemini-only)
- Affective Dialog — adjusts tone to match the caller’s emotional state. Best for support / empathy use cases.
- Proactivity — allows the model to stay silent when no response is appropriate (background chatter, the caller talking to someone else).
This card is hidden entirely when an OpenAI model is selected. Within
Gemini, both toggles are silently ignored on Gemini 3.1.
Turn Detection (OpenAI only)
How GPT Realtime decides the caller has finished speaking. Two modes:
- Semantic VAD (default) — uses a classifier over the caller’s words.
Less likely to chunk mid-sentence; friendlier to hesitant callers.
Configurable eagerness:
auto (≈ medium, tuned default), low
(lets callers take their time), medium, or high (chunks audio as
soon as possible).
- Server VAD — silence-based. Configurable sliders:
- Threshold (0–1, default 0.5) — energy bar for “this is speech”.
Raise for noisy environments; lower for soft-spoken callers.
- Prefix padding (default 300 ms) — audio kept before detected
speech, so plosives and soft starts aren’t clipped.
- Silence duration (default 500 ms) — required silence to mark
end-of-turn. Lower = snappier; higher = safer.
- Plugin default — don’t send a turn-detection config at all; the
LiveKit plugin’s built-in default applies.
Two response toggles apply to both modes:
- Auto-generate reply (default on) — model starts replying as soon as turn detection fires.
- Allow interruption (default on) — caller speech during the agent’s reply interrupts the current response.
The card has a reset button that restores all knobs to the LiveKit /
OpenAI recommended defaults.
Thinking (Gemini native-audio only)
Gemini’s native-audio models think before responding. By default this
chain-of-thought stays internal; toggling Include thoughts in
transcript on forwards the model’s thoughts to your transcript alongside
the normal output. Useful for debugging reasoning, noisy for production
logs. Off by default; reset button restores the default.
Custom Turn Detection (Gemini only)
Disables Gemini’s built-in VAD and routes input audio through LiveKit’s
MultilingualModel turn detector + a separate STT plugin you choose.
Gemini Live still speaks the reply — only the input path changes.
When this toggle is on, the STT tab re-appears in the tab bar (it’s
normally hidden in S2S mode). Configure your input STT there, then jump
to the Detection tab to pick the turn-detection mode (turn_detector
/ stt_endpointing / vad_only) and endpointing thresholds.
Useful when:
- Callers are in noisy environments and you trust a dedicated STT (e.g. Deepgram Nova-3) more than Gemini’s built-in VAD.
- You need an STT-specific feature on the input side (custom vocabulary, language detection, etc.).
- You want LiveKit’s MultilingualModel to do end-of-utterance prediction across 100+ languages.
Off by default.
Hosting
The hosting card adapts to the selected provider:
| Option | Available on | Who pays the provider | Who pays the platform fee |
|---|
| Platform (default) | Both | Platform-managed key, billed via your thinnestAI plan | Yes (your trial / PAYG) |
| BYO Gemini API key | Gemini only | You — paste a key from aistudio.google.com/app/apikey. Bills directly to your Google account. | Yes — only the platform fee |
| BYO OpenAI API key | OpenAI only | You — paste a key from platform.openai.com/api-keys. Bills Realtime API usage to your OpenAI account. | Yes — only the platform fee |
| Vertex AI | Gemini only | You — routes through your own GCP project. Requires GOOGLE_APPLICATION_CREDENTIALS (service account JSON) on the deployment. Project + region are optional overrides. | Yes — only the platform fee |
Keys are stored encrypted at rest. The key prefix is validated in-browser
(AIza… for Gemini, sk-… for OpenAI) and a warning surfaces if the
pasted value doesn’t match the expected provider — this prevents
accidentally sending a Gemini key to OpenAI or vice versa.
When you toggle Vertex AI on, the BYO key field is auto-cleared, and
vice versa. Switching from a Gemini model to an OpenAI model also
automatically disables Vertex AI (which is Gemini-specific).
Half-cascade — use your own TTS
S2S has an optional Use custom TTS toggle inside the S2S tab. When
enabled:
- The realtime model runs in TEXT modality — it listens and reasons,
but emits text instead of audio.
- Your selected TTS plugin (Cartesia, ElevenLabs, Sarvam, Aero, etc.)
speaks the reply.
- Click the gear icon next to the toggle to open an embedded TTS picker
modal — same provider/model/voice options as the cascaded TTS tab.
Half-cascade is useful when you want the realtime model’s listening +
reasoning quality but need a specific brand voice or language coverage
the provider doesn’t ship.
Half-cascade is incompatible with Gemini’s native-audio models —
they reject TEXT modality at the API level. The toggle is automatically
disabled and forced off when a native-audio model (the current Gemini
2.5 default) is selected. Switch to Gemini 3.1 Flash or either
OpenAI Realtime model to enable half-cascade.
Cost breakdown in S2S
When S2S mode is active, the cost-breakdown popup collapses STT + TTS +
LLM into a single row whose label tracks the selected provider:
- Pure S2S (native audio) —
Hosting / <Gemini Live | GPT Realtime | GPT Realtime Mini> / Telephony / Krisp NC.
- Half-cascade — the TTS row is restored to reflect the cascaded TTS plugin you picked.
- BYOK (Vertex AI, BYO Gemini key, or BYO OpenAI key) — the model row shows as Free because you pay the provider directly. The platform only charges its fee.
Per-minute list-price estimates:
| Model | Pricing | Approx. ₹/min (non-BYOK) |
|---|
| Gemini 2.5 Flash (live) | 3/12 per M audio tokens | ~₹6.80/min |
| Gemini 3.1 Flash (live) | 3/12 per M audio tokens | ~₹6.80/min |
| GPT Realtime | 32/64 per M audio tokens | ~₹9.05/min |
| GPT Realtime Mini | 13/26 per M audio tokens | ~₹3.65/min |
Estimates assume typical 1-min audio token counts; actual usage varies
with how much the caller and agent each speak.
Knowledge Base in S2S mode
In Cascaded mode, knowledge retrieval is automatic: every user turn triggers a hybrid search and the top passages are injected into the model’s chat context before it responds. The realtime models used in S2S mode don’t expose that hook — Gemini Live and OpenAI Realtime handle turn detection internally and skip the cascaded STT → LLM step where injection would normally happen.
To keep KB grounding working under S2S, thinnestAI registers a search_knowledge_base function tool whenever knowledge is attached to the agent. This is the standard pattern documented by both providers:
The realtime model decides when to call the tool. Pricing, features, hours, policy, product specs — anything documented in your knowledge base routes through it; chitchat (“How are you?”) does not.
{
"name": "search_knowledge_base",
"description": "Search the company's internal knowledge base for facts about products, services, pricing, plans, features, policies, hours, contact info, technical specs, or any company-specific information. Always call this tool instead of guessing whenever the user asks about anything the company would have documented. Pass the user's question (or a key phrase from it) as the query.",
"parameters": {
"query": "string — the user's question or a key phrase from it"
}
}
The tool returns up to six concatenated passages (1500 chars each, separated by \n\n[KB N] … markers). The model uses them to compose a grounded reply in audio.
Latency profile
| Scenario | Round-trip cost |
|---|
| User asks a documented question | ~300-600 ms (embedding + vector search + tool round-trip) |
| User asks chitchat | 0 ms — model doesn’t call the tool |
| Multi-turn follow-up on the same topic | 0 ms if the model can reuse the prior tool result from its context window |
The tool runs against the same hybrid pgvector index that powers cascaded retrieval; tenant-isolation guards (per-agent source_ids) apply identically.
Tenant isolation
The tool is bound to the specific knowledge sources attached to the agent. When the model invokes search_knowledge_base(query=…), the underlying search is filtered to those source IDs — the tool cannot leak cross-tenant content even if invoked with a misleading query. If no sources are configured, the tool returns “Knowledge base is not configured for this agent.” instead of falling back to an unfiltered search.
What you need to do
Nothing. Attach knowledge to your agent the usual way (Knowledge tab → pick or upload sources) and switch to S2S — the tool is registered automatically when the agent boots. Confirm registration in your logs:
[KB Tool] Registered search_knowledge_base tool (sources=3) — usable by S2S realtime models and as a cascaded-mode fallback
The same tool is also registered in Cascaded mode as a fallback for multi-turn cases where the prior turn’s auto-injection didn’t cover the new topic. In practice the cascaded model rarely calls it because context is already in chat_ctx.
If you previously saw S2S agents hallucinate company-specific information (e.g. inventing pricing or policies), make sure your knowledge sources are actually attached to the agent. The search_knowledge_base tool is only registered when knowledge is present — without it, the realtime model falls back to its training data.
Greeting behaviour
| Configuration | Greeting trigger |
|---|
| Pure S2S, Gemini 2.5 | Gemini Live speaks the configured greeting on its first turn — baked into the model’s system instructions as an opening-line block. |
| Pure S2S, OpenAI Realtime | The realtime model speaks the configured greeting on its first turn — passed inline through the agent’s initial generate-reply trigger. |
| Pure S2S, Gemini 3.1 | Gemini 3.1 doesn’t support agent-initiated turns. The user must speak first to start the conversation. Switch to Gemini 2.5, OpenAI Realtime, or enable Custom TTS for an agent-initiated greeting. |
| S2S half-cascade (any model) | Your cascaded TTS plugin speaks the greeting via session.say(). |
Quick start
- Open your agent in Agent Studio → Voice Configuration.
- In the header, click Speech-to-Speech.
- The S2S tab opens with the model sidebar on the left. Pick Gemini 2.5 Flash (recommended default), or click GPT Realtime to try OpenAI.
- Leave Voice on the model’s default (
Puck for Gemini, marin for OpenAI), or pick another from the dropdown.
- (Optional) Open the Hosting card and paste your own Gemini or OpenAI key.
- Save and click Try Voice Call.
Next Steps