Transcribe Audio

POST/api/stt/transcribe

Transcribe an audio file. Supports WAV, MP3, FLAC, OGG, and WebM formats.

No parameters for this endpoint.

Request (multipart/form-data)

Field	Type	Required	Default	Description
`audio`	file	Yes	—	Audio file (WAV, MP3, FLAC, OGG, WebM). Max 40s per request.
`language`	string	Yes	—	Language code (e.g. `hi`, `en`, `ta`, `bn`). See Languages.
`model`	string	No	`vega`	STT engine: `vega` (~100ms, live + file) or `dhara` (~400ms, file only, word timestamps).
`punctuate`	boolean	No	`true`	Add punctuation and capitalization automatically.
`kenlm`	boolean	No	`true`	Use KenLM language model for higher accuracy. Adds ~50ms. Available for 25 Indian languages.
`timestamps`	boolean	No	`false`	Include word-level timestamps (Dhara only).
`diarize`	boolean	No	`false`	Speaker diarization via pyannote. File only.
`sentiment`	boolean	No	`false`	Detect overall sentiment (positive/negative/neutral).
`format_numbers`	boolean	No	`false`	Auto-format numbers, currency, percentages in transcript.
`vad_filter`	boolean	No	`true`	Voice Activity Detection — skip silence segments (Dhara only). Auto-disabled for audio < 10s.
`bgm_filter`	boolean	No	`false`	Background music removal via spectral subtraction (Dhara only).
`enhance`	boolean	No	`true`	Auto Gain Control — normalize volume for speakerphone/far-field audio. Critical for Indian conditions.
`denoise`	boolean	No	`false`	DeepFilterNet3 neural noise suppression — removes traffic, crowds, TV, music background. ~10ms overhead.
`redact`	boolean	No	`false`	Redact Indian PII: Aadhaar, PAN, phone, credit card, email, UPI.
`keywords`	string	No	—	Comma-separated vocabulary boosting terms (company names, jargon).
`utterances`	boolean	No	`false`	Speaker-segmented utterances (requires `diarize=true`).

Audio Enhancement Pipeline

All audio is processed through an enhancement pipeline before ASR inference. This dramatically improves accuracy in real-world Indian conditions.

Auto Gain Control (`enhance=true`, default ON)

Normalizes audio volume to a target RMS level. Essential for:

Speakerphone calls — far-field mic picks up very quiet audio
Volume variation — different devices, network conditions
Indian call centers — agents with headsets vs customers on speakerphone

Without AGC, quiet audio produces garbage: "नत ेरा न आ श". With AGC: "नमस्ते, मेरा नाम आशुतोष है".

Neural Noise Suppression (`denoise=true`, opt-in)

DeepFilterNet3 removes background noise while preserving speech:

Traffic noise — auto-rickshaws, honking, construction
Crowd noise — markets, multi-speaker households
TV/music — Bollywood, news channels running in background
Wind — outdoor/agricultural recordings

Adds ~10ms latency. Enable for noisy environments.

Telephony Detection (automatic)

Auto-detects 8kHz narrowband audio (2G/GSM calls from Airtel, BSNL, Vi) and optimizes processing. No configuration needed — the server detects sample rate and adjusts automatically.

AEC (Acoustic Echo Cancellation) is handled at the transport layer by LiveKit/WebRTC before audio reaches the STT server. This is the correct architecture — the media pipeline removes echo before transcription.

Response

{
  "text": "नमस्ते, मेरा नाम अशुतोष है।",
  "language": "hi",
  "duration_seconds": 5.2,
  "processing_ms": 372.1,
  "model": "vega",
  "provider": "thinnestai",
  "confidence": 0.94,
  "words": [
    { "word": "नमस्ते", "start": 0.0, "end": 0.52, "confidence": 0.97 }
  ],
  "speakers": [
    { "speaker": "A", "start": 0.0, "end": 2.1, "text": "नमस्ते" }
  ],
  "sentiment": { "label": "positive", "score": 0.82 },
  "kenlm_used": true,
  "text_raw": "namaste mera naam ashutosh hai"
}

Field	Type	Description
`text`	string	Final transcribed text with punctuation and formatting applied.
`language`	string	Detected or requested language code.
`duration_seconds`	number	Audio duration in seconds.
`processing_ms`	number	Server-side processing time in milliseconds.
`model`	string	Engine used (`vega` or `dhara`).
`confidence`	number	Overall confidence score (0-1). Dhara only.
`words`	array	Word-level timestamps. Only when `timestamps=true` (Dhara).
`speakers`	array	Speaker diarization segments. Only when `diarize=true`.
`sentiment`	object	Sentiment analysis result. Only when `sentiment=true`.
`kenlm_used`	boolean	Whether KenLM language model was applied.
`text_raw`	string	Raw transcript before PII redaction. Only when `redact=true`.

Models

Model	Latency	Best For	Languages
Vega	~100ms	Live voice agents, real-time streaming	200+ (22 Indian + international)
Dhara	~400ms	File transcription, meetings, word timestamps	200+

Authentication

Include your API key in the Authorization header:

curl -X POST https://api.thinnest.ai/api/stt/transcribe \
  -H "Authorization: Bearer thns_sk_your_key_here" \
  -F "audio=@meeting.wav" \
  -F "language=hi" \
  -F "model=vega" \
  -F "punctuate=true" \
  -F "kenlm=true"

Pricing

Component	Cost
Transcription	₹0.25/min
All features (timestamps, diarization, sentiment, PII redaction)	Included

On this page