Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.thinnest.ai/llms.txt

Use this file to discover all available pages before exploring further.

The voice object on the v2 Agent API (and voice_config dict on v1) configures every voice-related setting on an agent. This page documents every field as of 2026-05-11 — including Speech-to-Speech (Gemini Live + OpenAI Realtime), noise-cancellation, and audio-ambience surfaces.

Top-level shape (v2)

{
  "voice": {
    "provider": "cartesia",
    "model": "sonic-3",
    "voice_id": "f786b574-daa5-4673-aa0c-cbe3e8534c02",
    "speed": 1.0,
    "fallback_provider": "deepgram",
    "fallback_model": "aura-2",
    "noise_cancellation": { "engine": "gtcrn", "strength": 0.3 },
    "ambience": {
      "ambient_sound": "office",
      "ambient_sound_volume": 0.3,
      "thinking_sound": "typing",
      "thinking_sound_volume": 0.5
    },
    "s2s": {
      "mode": "cascaded",
      "provider": "google_realtime",
      "model": "gemini-2.5-flash-native-audio-preview-12-2025",
      "voice": "Puck",
      "temperature": 0.8
    }
  }
}
For the v1 PUT /agents/update endpoint, send the same fields as a flat dict on voice_config (snake_case OR camelCase — both are accepted by the runtime reader).

Cascaded TTS (default)

FieldTypeDefaultNotes
providerstringcartesiaCartesia / ElevenLabs / Sarvam / Aero / Inworld / Rime / Deepgram
modelstringprovider defaulte.g. sonic-3, eleven_multilingual_v2, bulbul:v3
voice_idstringCartesia/ElevenLabs voice ID; Sarvam voice name (e.g. shubh); etc.
speedfloat1.0TTS playback speed
fallback_providerstringFailover provider when the primary errors
fallback_modelstringFailover model

noise_cancellation

"noise_cancellation": { "engine": "gtcrn", "strength": 0.3 }
FieldTypeDefaultValues
enginestringgtcrngtcrn (recommended, MIT) · hush (DeepFilterNet3 + speaker separation) · rnnoise (lightest CPU) · dtln (legacy) · none
strengthfloat (0.0-1.0)0.3Wet/dry blend. 0 = bypassed, 1 = full denoise. 0.3 is the validated multi-turn sweet spot.
krisp and bvc are accepted for backward-compat but silently fall back to gtcrn on self-hosted deployments.

ambience

"ambience": {
  "ambient_sound": "office",
  "ambient_sound_volume": 0.3,
  "thinking_sound": "typing",
  "thinking_sound_volume": 0.5
}
FieldTypeDefaultValues
ambient_soundstringnonenone · office (built-in chatter) · custom
ambient_sound_urlstring""URL to a .mp3 / .wav (≤ 25 MB). Required when ambient_sound = custom.
ambient_sound_volumefloat (0.0-1.0)0.3
thinking_soundstringnonenone · typing (built-in keyboard) · custom
thinking_sound_urlstring""Required when thinking_sound = custom
thinking_sound_volumefloat (0.0-1.0)0.5
The thinking sound is auto-triggered whenever the agent is in its “thinking” state — useful for masking LLM latency on tool calls and handoffs. Custom uploads can be made via the POST /upload endpoint.

s2s — Speech-to-Speech (Gemini Live or OpenAI Realtime)

"s2s": {
  "mode": "s2s",
  "provider": "google_realtime",
  "model": "gemini-2.5-flash-native-audio-preview-12-2025",
  "voice": "Puck",
  "temperature": 0.8,
  "max_output_tokens": 0,
  "modality": "AUDIO",
  "affective_dialog": false,
  "proactivity": false,
  "vertexai": false,
  "use_byok_key": false,
  "use_custom_tts": false
}
Or for OpenAI Realtime:
"s2s": {
  "mode": "s2s",
  "provider": "openai_realtime",
  "model": "gpt-realtime",
  "voice": "marin",
  "temperature": 0.8,
  "modality": "AUDIO",
  "use_byok_key": false,
  "use_custom_tts": false
}

Mode

FieldTypeDefaultNotes
modestringcascadedcascaded (classic STT → LLM → TTS) or s2s (single realtime model).

Provider + model + voice

FieldTypeDefaultNotes
providerstringgoogle_realtimegoogle_realtime (Gemini Live) or openai_realtime (OpenAI Realtime). The backend also infers this from the model id when omitted (gpt-* → OpenAI; everything else → Google).
modelstringgemini-2.5-flash-native-audio-preview-12-2025Gemini: gemini-2.5-flash-native-audio-preview-12-2025 (recommended) or gemini-3.1-flash-live-preview (experimental, single-turn). OpenAI: gpt-realtime (GA) or gpt-realtime-mini.
voicestringPuckGemini: any of 30 voices (Puck, Kore, Charon, Sulafat, Achird, Despina, …). OpenAI: marin (default), cedar, or any legacy voice — alloy, ash, ballad, coral, echo, sage, shimmer, verse.
temperaturefloat0.8Gemini accepts 0.02.0. OpenAI Realtime accepts 0.61.2 and the backend clamps out-of-range values defensively (the upstream plugin treats this field as deprecated on Realtime v1 — set it for forward compatibility).
max_output_tokensint0 (no cap)Applied at construction on Gemini Live. The OpenAI Realtime plugin doesn’t expose a constructor cap; this setting is currently advisory on OpenAI.
modalitystringAUDIOAUDIO (native S2S) or TEXT (auto-set when use_custom_tts is true).

Conversational tuning (Gemini 2.5 only — silently ignored on Gemini 3.1 and all OpenAI models)

FieldTypeDefaultNotes
affective_dialogboolfalseAdjusts tone to caller emotion.
proactivityboolfalseModel may stay silent on background chatter.

Hosting (mutually exclusive)

FieldTypeDefaultNotes
vertexaiboolfalseGemini only. Routes through your GCP project. Requires GOOGLE_APPLICATION_CREDENTIALS set on the deployment. Auto-disabled when an OpenAI model is selected.
projectstring""Gemini Vertex AI only. Optional override. Falls back to GOOGLE_CLOUD_PROJECT env var or service-account file.
locationstring""Gemini Vertex AI only. Optional override. Falls back to GOOGLE_CLOUD_LOCATION (default us-central1).
use_byok_keyboolfalseWhen true with api_key set, bills the chosen provider directly via your own key. Mutually exclusive with vertexai.
api_keystring""Provider-specific BYO key. Gemini keys start with AIza… (aistudio.google.com/app/apikey). OpenAI keys start with sk-… (platform.openai.com/api-keys). Stored encrypted.

Half-cascade

FieldTypeDefaultNotes
use_custom_ttsboolfalseForces modality = TEXT and routes the realtime model’s text output through the cascaded TTS plugin (provider / model / voice_id at the top level). Incompatible with Gemini native-audio models — the API rejects TEXT modality there. Works on Gemini 3.1 and both OpenAI models.

Gemini-only — thinking + custom turn detection

FieldTypeDefaultNotes
thinking_include_thoughtsboolfalseNative-audio Gemini models think before responding. When false (default), the chain-of-thought stays internal; true forwards thoughts as transcripts.
custom_turn_detectionboolfalseDisables Gemini’s built-in VAD and routes input audio through LiveKit’s MultilingualModel + the cascaded STT picker (stt_provider / stt_model at the agent top-level). Gemini Live still speaks the reply. Turn-detection mode + endpointing come from the cascaded turn_detection_mode field.

OpenAI-only — turn detection

FieldTypeDefaultNotes
oa_turn_detection_modestringsemantic_vadsemantic_vad (classifier on the caller’s words — default), server_vad (silence-based), or none (defer to the plugin’s built-in default).
oa_semantic_eagernessstringautoSemantic VAD only. auto (≈ medium), low, medium, or high. Controls how aggressively the model commits to end-of-turn.
oa_server_thresholdfloat0.5Server VAD only. 0.01.0 energy threshold for considering audio as speech. Higher = requires louder audio (noise-tolerant).
oa_server_prefix_padding_msint300Server VAD only. Milliseconds of audio kept before detected speech onset (captures plosives / soft starts).
oa_server_silence_duration_msint500Server VAD only. Milliseconds of silence required to mark end-of-turn. Lower = snappier replies; higher = safer for hesitant callers.
oa_create_responsebooltrueAuto-generate a reply when turn detection fires. Set false to call generate_reply() manually from custom code.
oa_interrupt_responsebooltrueCaller speech during the agent’s reply interrupts the current response.

Greeting behaviour by mode

ConfigurationWhat happens at call start
CascadedTTS plugin speaks the configured greeting via session.say().
S2S half-cascade (any model)TTS plugin speaks the greeting (same as Cascaded).
S2S native-audio + Gemini 2.5Gemini Live speaks the greeting on its first turn (greeting text is baked into its system instructions).
S2S native-audio + OpenAI RealtimeThe realtime model speaks the greeting on its first turn (greeting text is passed inline through the agent’s initial generate_reply trigger).
S2S native-audio + Gemini 3.1No agent-initiated greeting — the user must speak first (Gemini 3.1 doesn’t support generate_reply).

Billing

The cost-breakdown widget in the dashboard reflects the active mode and the selected S2S provider. The collapsed row is labelled Gemini Live for Gemini models and GPT Realtime / GPT Realtime Mini for OpenAI:
  • CascadedHosting + STT + TTS + LLM + Telephony
  • S2S nativeHosting + <realtime row> + Telephony (STT/TTS collapsed into the realtime row)
  • S2S half-cascadeHosting + <realtime row> + TTS + Telephony
  • S2S BYOK (Vertex AI, BYO Gemini key, or BYO OpenAI key) — the realtime row is Free (you pay the provider directly); only the platform fee is billed.
Per-minute list-price estimates (non-BYOK): Gemini Live ≈ ₹6.80/min, GPT Realtime ≈ ₹9.05/min, GPT Realtime Mini ≈ ₹3.65/min.

Endpoints

  • PATCH /v2/agents/{public_id} — partial update; deep-merges into agent_config.
  • POST /v2/agents — create a new agent with full voice config.
  • PUT /agents/update (v1) — accepts the same fields as a flat voice_config dict (snake_case or camelCase).
See Update Agent and Create Agent for example requests.