Documentation Index
Fetch the complete documentation index at: https://docs.thinnest.ai/llms.txt
Use this file to discover all available pages before exploring further.
The voice object on the v2 Agent API (and voice_config dict on v1)
configures every voice-related setting on an agent. This page documents
every field as of 2026-05-11 — including Speech-to-Speech (Gemini
Live + OpenAI Realtime), noise-cancellation, and audio-ambience surfaces.
Top-level shape (v2)
{
"voice": {
"provider": "cartesia",
"model": "sonic-3",
"voice_id": "f786b574-daa5-4673-aa0c-cbe3e8534c02",
"speed": 1.0,
"fallback_provider": "deepgram",
"fallback_model": "aura-2",
"noise_cancellation": { "engine": "gtcrn", "strength": 0.3 },
"ambience": {
"ambient_sound": "office",
"ambient_sound_volume": 0.3,
"thinking_sound": "typing",
"thinking_sound_volume": 0.5
},
"s2s": {
"mode": "cascaded",
"provider": "google_realtime",
"model": "gemini-2.5-flash-native-audio-preview-12-2025",
"voice": "Puck",
"temperature": 0.8
}
}
}
For the v1 PUT /agents/update endpoint, send the same fields as a
flat dict on voice_config (snake_case OR camelCase — both are accepted
by the runtime reader).
Cascaded TTS (default)
| Field | Type | Default | Notes |
|---|
provider | string | cartesia | Cartesia / ElevenLabs / Sarvam / Aero / Inworld / Rime / Deepgram |
model | string | provider default | e.g. sonic-3, eleven_multilingual_v2, bulbul:v3 |
voice_id | string | — | Cartesia/ElevenLabs voice ID; Sarvam voice name (e.g. shubh); etc. |
speed | float | 1.0 | TTS playback speed |
fallback_provider | string | — | Failover provider when the primary errors |
fallback_model | string | — | Failover model |
noise_cancellation
"noise_cancellation": { "engine": "gtcrn", "strength": 0.3 }
| Field | Type | Default | Values |
|---|
engine | string | gtcrn | gtcrn (recommended, MIT) · hush (DeepFilterNet3 + speaker separation) · rnnoise (lightest CPU) · dtln (legacy) · none |
strength | float (0.0-1.0) | 0.3 | Wet/dry blend. 0 = bypassed, 1 = full denoise. 0.3 is the validated multi-turn sweet spot. |
krisp and bvc are accepted for backward-compat but silently fall
back to gtcrn on self-hosted deployments.
ambience
"ambience": {
"ambient_sound": "office",
"ambient_sound_volume": 0.3,
"thinking_sound": "typing",
"thinking_sound_volume": 0.5
}
| Field | Type | Default | Values |
|---|
ambient_sound | string | none | none · office (built-in chatter) · custom |
ambient_sound_url | string | "" | URL to a .mp3 / .wav (≤ 25 MB). Required when ambient_sound = custom. |
ambient_sound_volume | float (0.0-1.0) | 0.3 | |
thinking_sound | string | none | none · typing (built-in keyboard) · custom |
thinking_sound_url | string | "" | Required when thinking_sound = custom |
thinking_sound_volume | float (0.0-1.0) | 0.5 | |
The thinking sound is auto-triggered whenever the agent is in its
“thinking” state — useful for masking LLM latency on tool calls and
handoffs. Custom uploads can be made via the POST /upload endpoint.
s2s — Speech-to-Speech (Gemini Live or OpenAI Realtime)
"s2s": {
"mode": "s2s",
"provider": "google_realtime",
"model": "gemini-2.5-flash-native-audio-preview-12-2025",
"voice": "Puck",
"temperature": 0.8,
"max_output_tokens": 0,
"modality": "AUDIO",
"affective_dialog": false,
"proactivity": false,
"vertexai": false,
"use_byok_key": false,
"use_custom_tts": false
}
Or for OpenAI Realtime:
"s2s": {
"mode": "s2s",
"provider": "openai_realtime",
"model": "gpt-realtime",
"voice": "marin",
"temperature": 0.8,
"modality": "AUDIO",
"use_byok_key": false,
"use_custom_tts": false
}
Mode
| Field | Type | Default | Notes |
|---|
mode | string | cascaded | cascaded (classic STT → LLM → TTS) or s2s (single realtime model). |
Provider + model + voice
| Field | Type | Default | Notes |
|---|
provider | string | google_realtime | google_realtime (Gemini Live) or openai_realtime (OpenAI Realtime). The backend also infers this from the model id when omitted (gpt-* → OpenAI; everything else → Google). |
model | string | gemini-2.5-flash-native-audio-preview-12-2025 | Gemini: gemini-2.5-flash-native-audio-preview-12-2025 (recommended) or gemini-3.1-flash-live-preview (experimental, single-turn). OpenAI: gpt-realtime (GA) or gpt-realtime-mini. |
voice | string | Puck | Gemini: any of 30 voices (Puck, Kore, Charon, Sulafat, Achird, Despina, …). OpenAI: marin (default), cedar, or any legacy voice — alloy, ash, ballad, coral, echo, sage, shimmer, verse. |
temperature | float | 0.8 | Gemini accepts 0.0–2.0. OpenAI Realtime accepts 0.6–1.2 and the backend clamps out-of-range values defensively (the upstream plugin treats this field as deprecated on Realtime v1 — set it for forward compatibility). |
max_output_tokens | int | 0 (no cap) | Applied at construction on Gemini Live. The OpenAI Realtime plugin doesn’t expose a constructor cap; this setting is currently advisory on OpenAI. |
modality | string | AUDIO | AUDIO (native S2S) or TEXT (auto-set when use_custom_tts is true). |
Conversational tuning (Gemini 2.5 only — silently ignored on Gemini 3.1 and all OpenAI models)
| Field | Type | Default | Notes |
|---|
affective_dialog | bool | false | Adjusts tone to caller emotion. |
proactivity | bool | false | Model may stay silent on background chatter. |
Hosting (mutually exclusive)
| Field | Type | Default | Notes |
|---|
vertexai | bool | false | Gemini only. Routes through your GCP project. Requires GOOGLE_APPLICATION_CREDENTIALS set on the deployment. Auto-disabled when an OpenAI model is selected. |
project | string | "" | Gemini Vertex AI only. Optional override. Falls back to GOOGLE_CLOUD_PROJECT env var or service-account file. |
location | string | "" | Gemini Vertex AI only. Optional override. Falls back to GOOGLE_CLOUD_LOCATION (default us-central1). |
use_byok_key | bool | false | When true with api_key set, bills the chosen provider directly via your own key. Mutually exclusive with vertexai. |
api_key | string | "" | Provider-specific BYO key. Gemini keys start with AIza… (aistudio.google.com/app/apikey). OpenAI keys start with sk-… (platform.openai.com/api-keys). Stored encrypted. |
Half-cascade
| Field | Type | Default | Notes |
|---|
use_custom_tts | bool | false | Forces modality = TEXT and routes the realtime model’s text output through the cascaded TTS plugin (provider / model / voice_id at the top level). Incompatible with Gemini native-audio models — the API rejects TEXT modality there. Works on Gemini 3.1 and both OpenAI models. |
Gemini-only — thinking + custom turn detection
| Field | Type | Default | Notes |
|---|
thinking_include_thoughts | bool | false | Native-audio Gemini models think before responding. When false (default), the chain-of-thought stays internal; true forwards thoughts as transcripts. |
custom_turn_detection | bool | false | Disables Gemini’s built-in VAD and routes input audio through LiveKit’s MultilingualModel + the cascaded STT picker (stt_provider / stt_model at the agent top-level). Gemini Live still speaks the reply. Turn-detection mode + endpointing come from the cascaded turn_detection_mode field. |
OpenAI-only — turn detection
| Field | Type | Default | Notes |
|---|
oa_turn_detection_mode | string | semantic_vad | semantic_vad (classifier on the caller’s words — default), server_vad (silence-based), or none (defer to the plugin’s built-in default). |
oa_semantic_eagerness | string | auto | Semantic VAD only. auto (≈ medium), low, medium, or high. Controls how aggressively the model commits to end-of-turn. |
oa_server_threshold | float | 0.5 | Server VAD only. 0.0–1.0 energy threshold for considering audio as speech. Higher = requires louder audio (noise-tolerant). |
oa_server_prefix_padding_ms | int | 300 | Server VAD only. Milliseconds of audio kept before detected speech onset (captures plosives / soft starts). |
oa_server_silence_duration_ms | int | 500 | Server VAD only. Milliseconds of silence required to mark end-of-turn. Lower = snappier replies; higher = safer for hesitant callers. |
oa_create_response | bool | true | Auto-generate a reply when turn detection fires. Set false to call generate_reply() manually from custom code. |
oa_interrupt_response | bool | true | Caller speech during the agent’s reply interrupts the current response. |
Greeting behaviour by mode
| Configuration | What happens at call start |
|---|
| Cascaded | TTS plugin speaks the configured greeting via session.say(). |
| S2S half-cascade (any model) | TTS plugin speaks the greeting (same as Cascaded). |
| S2S native-audio + Gemini 2.5 | Gemini Live speaks the greeting on its first turn (greeting text is baked into its system instructions). |
| S2S native-audio + OpenAI Realtime | The realtime model speaks the greeting on its first turn (greeting text is passed inline through the agent’s initial generate_reply trigger). |
| S2S native-audio + Gemini 3.1 | No agent-initiated greeting — the user must speak first (Gemini 3.1 doesn’t support generate_reply). |
Billing
The cost-breakdown widget in the dashboard reflects the active mode and
the selected S2S provider. The collapsed row is labelled Gemini Live
for Gemini models and GPT Realtime / GPT Realtime Mini for OpenAI:
- Cascaded —
Hosting + STT + TTS + LLM + Telephony
- S2S native —
Hosting + <realtime row> + Telephony (STT/TTS collapsed into the realtime row)
- S2S half-cascade —
Hosting + <realtime row> + TTS + Telephony
- S2S BYOK (Vertex AI, BYO Gemini key, or BYO OpenAI key) — the realtime row is Free (you pay the provider directly); only the platform fee is billed.
Per-minute list-price estimates (non-BYOK): Gemini Live ≈ ₹6.80/min,
GPT Realtime ≈ ₹9.05/min, GPT Realtime Mini ≈ ₹3.65/min.
Endpoints
PATCH /v2/agents/{public_id} — partial update; deep-merges into agent_config.
POST /v2/agents — create a new agent with full voice config.
PUT /agents/update (v1) — accepts the same fields as a flat voice_config dict (snake_case or camelCase).
See Update Agent and
Create Agent for example requests.