Voice Config Reference

The voice object on the v2 Agent API (and voice_config dict on v1) configures every voice-related setting on an agent. This page documents every field as of 2026-05-11 — including Speech-to-Speech (Gemini Live + OpenAI Realtime), noise-cancellation, and audio-ambience surfaces.

Top-level shape (v2)

{
  "voice": {
    "provider": "cartesia",
    "model": "sonic-3",
    "voice_id": "f786b574-daa5-4673-aa0c-cbe3e8534c02",
    "speed": 1.0,
    "fallback_provider": "deepgram",
    "fallback_model": "aura-2",
    "noise_cancellation": { "engine": "gtcrn", "strength": 0.3 },
    "ambience": {
      "ambient_sound": "office",
      "ambient_sound_volume": 0.3,
      "thinking_sound": "typing",
      "thinking_sound_volume": 0.5
    },
    "s2s": {
      "mode": "cascaded",
      "provider": "google_realtime",
      "model": "gemini-2.5-flash-native-audio-preview-12-2025",
      "voice": "Puck",
      "temperature": 0.8
    }
  }
}

For the v1 PUT /agents/update endpoint, send the same fields as a flat dict on voice_config (snake_case OR camelCase — both are accepted by the runtime reader).

Cascaded TTS (default)

Field	Type	Default	Notes
`provider`	string	`cartesia`	Cartesia / ElevenLabs / Sarvam / Aero / Inworld / Rime / Deepgram
`model`	string	provider default	e.g. `sonic-3`, `eleven_multilingual_v2`, `bulbul:v3`
`voice_id`	string	—	Cartesia/ElevenLabs voice ID; Sarvam voice name (e.g. `shubh`); etc.
`speed`	float	`1.0`	TTS playback speed
`fallback_provider`	string	—	Failover provider when the primary errors
`fallback_model`	string	—	Failover model

`noise_cancellation`

"noise_cancellation": { "engine": "gtcrn", "strength": 0.3 }

Field	Type	Default	Values
`engine`	string	`gtcrn`	`gtcrn` (recommended, MIT) · `hush` (DeepFilterNet3 + speaker separation) · `rnnoise` (lightest CPU) · `dtln` (legacy) · `none`
`strength`	float (0.0-1.0)	`0.3`	Wet/dry blend. 0 = bypassed, 1 = full denoise. 0.3 is the validated multi-turn sweet spot.

krisp and bvc are accepted for backward-compat but silently fall back to gtcrn on self-hosted deployments.

`ambience`

"ambience": {
  "ambient_sound": "office",
  "ambient_sound_volume": 0.3,
  "thinking_sound": "typing",
  "thinking_sound_volume": 0.5
}

Field	Type	Default	Values
`ambient_sound`	string	`none`	`none` · `office` (built-in chatter) · `custom`
`ambient_sound_url`	string	`""`	URL to a `.mp3` / `.wav` (≤ 25 MB). Required when `ambient_sound` = `custom`.
`ambient_sound_volume`	float (0.0-1.0)	`0.3`
`thinking_sound`	string	`none`	`none` · `typing` (built-in keyboard) · `custom`
`thinking_sound_url`	string	`""`	Required when `thinking_sound` = `custom`
`thinking_sound_volume`	float (0.0-1.0)	`0.5`

The thinking sound is auto-triggered whenever the agent is in its “thinking” state — useful for masking LLM latency on tool calls and handoffs. Custom uploads can be made via the POST /upload endpoint.

`s2s` — Speech-to-Speech (Gemini Live or OpenAI Realtime)

"s2s": {
  "mode": "s2s",
  "provider": "google_realtime",
  "model": "gemini-2.5-flash-native-audio-preview-12-2025",
  "voice": "Puck",
  "temperature": 0.8,
  "max_output_tokens": 0,
  "modality": "AUDIO",
  "affective_dialog": false,
  "proactivity": false,
  "vertexai": false,
  "use_byok_key": false,
  "use_custom_tts": false
}

Or for OpenAI Realtime:

"s2s": {
  "mode": "s2s",
  "provider": "openai_realtime",
  "model": "gpt-realtime",
  "voice": "marin",
  "temperature": 0.8,
  "modality": "AUDIO",
  "use_byok_key": false,
  "use_custom_tts": false
}

Mode

Field	Type	Default	Notes
`mode`	string	`cascaded`	`cascaded` (classic STT → LLM → TTS) or `s2s` (single realtime model).

Provider + model + voice

Field	Type	Default	Notes
`provider`	string	`google_realtime`	`google_realtime` (Gemini Live) or `openai_realtime` (OpenAI Realtime). The backend also infers this from the `model` id when omitted (`gpt-*` → OpenAI; everything else → Google).
`model`	string	`gemini-2.5-flash-native-audio-preview-12-2025`	Gemini: `gemini-2.5-flash-native-audio-preview-12-2025` (recommended) or `gemini-3.1-flash-live-preview` (experimental, single-turn). OpenAI: `gpt-realtime` (GA) or `gpt-realtime-mini`.
`voice`	string	`Puck`	Gemini: any of 30 voices (Puck, Kore, Charon, Sulafat, Achird, Despina, …). OpenAI: `marin` (default), `cedar`, or any legacy voice — `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`.
`temperature`	float	`0.8`	Gemini accepts `0.0`–`2.0`. OpenAI Realtime accepts `0.6`–`1.2` and the backend clamps out-of-range values defensively (the upstream plugin treats this field as deprecated on Realtime v1 — set it for forward compatibility).
`max_output_tokens`	int	`0` (no cap)	Applied at construction on Gemini Live. The OpenAI Realtime plugin doesn’t expose a constructor cap; this setting is currently advisory on OpenAI.
`modality`	string	`AUDIO`	`AUDIO` (native S2S) or `TEXT` (auto-set when `use_custom_tts` is true).

Conversational tuning (Gemini 2.5 only — silently ignored on Gemini 3.1 and all OpenAI models)

Field	Type	Default	Notes
`affective_dialog`	bool	`false`	Adjusts tone to caller emotion.
`proactivity`	bool	`false`	Model may stay silent on background chatter.

Hosting (mutually exclusive)

Field	Type	Default	Notes
`vertexai`	bool	`false`	Gemini only. Routes through your GCP project. Requires `GOOGLE_APPLICATION_CREDENTIALS` set on the deployment. Auto-disabled when an OpenAI model is selected.
`project`	string	`""`	Gemini Vertex AI only. Optional override. Falls back to `GOOGLE_CLOUD_PROJECT` env var or service-account file.
`location`	string	`""`	Gemini Vertex AI only. Optional override. Falls back to `GOOGLE_CLOUD_LOCATION` (default `us-central1`).
`use_byok_key`	bool	`false`	When true with `api_key` set, bills the chosen provider directly via your own key. Mutually exclusive with `vertexai`.
`api_key`	string	`""`	Provider-specific BYO key. Gemini keys start with `AIza…` (aistudio.google.com/app/apikey). OpenAI keys start with `sk-…` (platform.openai.com/api-keys). Stored encrypted.

Half-cascade

Field	Type	Default	Notes
`use_custom_tts`	bool	`false`	Forces `modality = TEXT` and routes the realtime model’s text output through the cascaded TTS plugin (`provider` / `model` / `voice_id` at the top level). Incompatible with Gemini native-audio models — the API rejects TEXT modality there. Works on Gemini 3.1 and both OpenAI models.

Gemini-only — thinking + custom turn detection

Field	Type	Default	Notes
`thinking_include_thoughts`	bool	`false`	Native-audio Gemini models think before responding. When `false` (default), the chain-of-thought stays internal; `true` forwards thoughts as transcripts.
`custom_turn_detection`	bool	`false`	Disables Gemini’s built-in VAD and routes input audio through LiveKit’s MultilingualModel + the cascaded STT picker (`stt_provider` / `stt_model` at the agent top-level). Gemini Live still speaks the reply. Turn-detection mode + endpointing come from the cascaded `turn_detection_mode` field.

OpenAI-only — turn detection

Field	Type	Default	Notes
`oa_turn_detection_mode`	string	`semantic_vad`	`semantic_vad` (classifier on the caller’s words — default), `server_vad` (silence-based), or `none` (defer to the plugin’s built-in default).
`oa_semantic_eagerness`	string	`auto`	Semantic VAD only. `auto` (≈ medium), `low`, `medium`, or `high`. Controls how aggressively the model commits to end-of-turn.
`oa_server_threshold`	float	`0.5`	Server VAD only. `0.0`–`1.0` energy threshold for considering audio as speech. Higher = requires louder audio (noise-tolerant).
`oa_server_prefix_padding_ms`	int	`300`	Server VAD only. Milliseconds of audio kept before detected speech onset (captures plosives / soft starts).
`oa_server_silence_duration_ms`	int	`500`	Server VAD only. Milliseconds of silence required to mark end-of-turn. Lower = snappier replies; higher = safer for hesitant callers.
`oa_create_response`	bool	`true`	Auto-generate a reply when turn detection fires. Set `false` to call `generate_reply()` manually from custom code.
`oa_interrupt_response`	bool	`true`	Caller speech during the agent’s reply interrupts the current response.

Greeting behaviour by mode

Configuration	What happens at call start
Cascaded	TTS plugin speaks the configured greeting via `session.say()`.
S2S half-cascade (any model)	TTS plugin speaks the greeting (same as Cascaded).
S2S native-audio + Gemini 2.5	Gemini Live speaks the greeting on its first turn (greeting text is baked into its system instructions).
S2S native-audio + OpenAI Realtime	The realtime model speaks the greeting on its first turn (greeting text is passed inline through the agent’s initial `generate_reply` trigger).
S2S native-audio + Gemini 3.1	No agent-initiated greeting — the user must speak first (Gemini 3.1 doesn’t support `generate_reply`).

Billing

The cost-breakdown widget in the dashboard reflects the active mode and the selected S2S provider. The collapsed row is labelled Gemini Live for Gemini models and GPT Realtime / GPT Realtime Mini for OpenAI:

Cascaded — Hosting + STT + TTS + LLM + Telephony
S2S native — Hosting + <realtime row> + Telephony (STT/TTS collapsed into the realtime row)
S2S half-cascade — Hosting + <realtime row> + TTS + Telephony
S2S BYOK (Vertex AI, BYO Gemini key, or BYO OpenAI key) — the realtime row is Free (you pay the provider directly); only the platform fee is billed.

Per-minute list-price estimates (non-BYOK): Gemini Live ≈ ₹6.80/min, GPT Realtime ≈ ₹9.05/min, GPT Realtime Mini ≈ ₹3.65/min.

Endpoints

PATCH /v2/agents/{public_id} — partial update; deep-merges into agent_config.
POST /v2/agents — create a new agent with full voice config.
PUT /agents/update (v1) — accepts the same fields as a flat voice_config dict (snake_case or camelCase).

See Update Agent and Create Agent for example requests.

​Top-level shape (v2)

​Cascaded TTS (default)

​noise_cancellation

​ambience

​s2s — Speech-to-Speech (Gemini Live or OpenAI Realtime)

​Mode

​Provider + model + voice

​Conversational tuning (Gemini 2.5 only — silently ignored on Gemini 3.1 and all OpenAI models)

​Hosting (mutually exclusive)

​Half-cascade

​Gemini-only — thinking + custom turn detection

​OpenAI-only — turn detection

​Greeting behaviour by mode

​Billing

​Endpoints