Building a Real-Time AI Voice Assistant for Indian Languages

Author:

Ignite Solutions

June 9, 2026

The fastest voice architecture we tested was not the one we shipped. And that decision, choosing a slower pipeline over a faster one, turned out to be the most important architectural call in the entire project.

Introduction

Building a real-time AI voice assistant that understands Hindi, Marathi, Hinglish, and English in the same conversation is a harder problem than it looks. Most voice AI benchmarks measure English latency. Most production voice agents are English-first. The moment you add multilingual support for Indian languages, especially with code-switching, the rules change entirely.

This post is about what we learned during the Ignite Voice Agent POC: the architecture decisions we made, the ones we reversed, the real latency and cost numbers from our benchmarks, and the novel patterns we developed to make voice feel instantaneous even when the AI is still thinking.

What We Set Out to Build

We set out to build out very own Jarvis. A voice agent that hears the tasks that I want to get done today. But hearing is just the first step, the fact that it is connected to intelligence, it can ask me clarifying questions, suggest approaches and break down complex tasks.

The inspiration came from a command line application that we use called the Take5, i.e. setup reminders for the 5 tasks we need to work on. But the real trigger or rather the functional need to experiment with voice came after an internal conversation as to how we foresee the masses in India communicating with AI. They will not care whether it is AI or not, a real person or not.

Interactive Voice Response and public service announcements are already there, but there is need to integrate intelligence in the pipeline as well. Intelligence will allow real conversation to happen vis-à-vis Interviews, Customer Support or rather any generic conversations. The solution is there, but we wanted to more understand at what cost, how many solutions do we have and how they campare with each other.

Choosing a Voice Architecture: Cascade vs. S2S

The first question was architectural: should the voice layer be a cascade pipeline or a speech-to-speech model?

Cascade Pipeline

A cascade pipeline chains three separate services: a Speech-to-Text model converts audio to text, an LLM processes that text and generates a response, and a Text-to-Speech model converts the response back to audio.

S2S Pipeline

A speech-to-speech (S2S) model — like OpenAI’s Realtime API — handles all three in a single model pass. It hears the user, reasons over the audio directly, and speaks back. No intermediate text. Roughly 3–5x faster on paper.

We tested both. Here is what we actually measured:

Factor	Cascade	Realtime (OpenAI gpt-realtime-2)
Latency	3–4 seconds	2–3 seconds
Answer accuracy	High, context understood correctly	Sometimes shallow or incorrect
Same-language reply	Yes, consistently	Switches language mid-reply
Hinglish handling	Correctly interpreted	Weak grammatical mistakes
Hindi / Marathi grammar	Correct	Noticeable errors
Reliability for tasks	Strong	Inconsistent
Cost per minute	~₹2.17	~₹12.49

The Realtime model feels faster. It is not, however, more reliable.

When users spoke first in English and then switched to Hindi mid-sentence — a completely natural pattern in Indian conversational speech — the S2S model failed. It responded in the wrong language, misinterpreted intent, or generated grammatically broken Hindi output. For a voice agent whose job is to create structured tasks from natural speech, a mistranscribed or misunderstood command is worse than a slow one. You cannot undo a wrong task by speaking faster.

We chose cascade. The 1–2 second speed gap is perceptible but acceptable. Incorrect task creation is not.

The Indian Language Problem

The hardest part of building an Indian-language voice agent is not the intelligence layer; it is the speech layer.

Our first hypothesis was logical: if cloud STT adds 2–3 seconds of latency, use on-device STT and eliminate the delay. We ran a full experiment using Android’s SpeechRecognizer API via a native bridge.
The results were clear, and not in the direction we hoped:

Approach	Speed	Accuracy	Hinglish	Verdict
React Native STT libraries	Slow	Acceptable	Acceptable	No benefit over cloud
Android SpeechRecognizer (native)	Very fast	Poor	Fails	Not usable
Cloud STT — Deepgram nova-3	Very fast	Excellent (English only)	Fails	Indic not supported
Cloud STT — Sarvam saaras:v3	2.7s avg	Excellent	Excellent	Chosen

On-device STT is fast because it is optimized for command-style English. It cannot handle Hinglish code-switching, it misrecognizes Hindi words in natural speech, and Marathi accuracy degrades significantly in conversational input. For a system that turns transcripts into structured task plans, a wrong transcript means a wrong task. Speed without accuracy is worthless.

Deepgram was also evaluated and eliminated. Its English latency (0.175–1.5s) is exceptional, but its Hindi and Marathi support is unreliable. For an English-only product, Deepgram would be an excellent choice. For this one, it was eliminated in the first round.

Sarvam AI was chosen not because it is the fastest option, but because it is the only option with the multilingual accuracy this product requires. Trained on over 25,000 hours of Indian audio, with native support for code-switching, it consistently produced usable transcripts across all four languages tested.

Inside the Cascade Pipeline

The cascade pipeline is not a simple chain. It is a carefully tuned sequence of decisions.

The production stack:

STT: Sarvam AI saaras:v3
LLM: Google Gemini 3.1 Flash Lite via OpenRouter
TTS: Sarvam AI bulbul:v3, voice: shubh
Transport: LiveKit WebRTC
Orchestration framework: Pipecat v1.1.0

The LLM benchmark was decisive:

Provider	English	Hindi	Hinglish	Marathi
Sarvam AI LLM	2.3s	3.3s	4.7s	3.9s
OpenAI GPT-4o mini	1.6–3.0s	1.2s	1.7s	1.2s
Gemini 3.1 Flash Lite (chosen)	0.96s	1.09s	0.7s	1.13s

Gemini 3.1 Flash Lite won across every language category, often by a factor of two or more. Its consistency across languages mattered as much as raw speed — the user experience should not degrade when the user switches from English to Hindi.

One critical production detail that benchmarks never capture: Sarvam’s TTS service has a variable Time-To-First-Byte. Under light load, it sits at 0.5–0.9 seconds. Under load, we observed spikes to 5–7 seconds. The Pipecat library’s default TTS timeout is 3 seconds. Result: under load, the TTS response dropped silently, the assistant went quiet with no error, no exception, no log entry. Setting the TTS timeout to 10 seconds resolved the issue entirely. Monitor your TTS TTFB independently in production.

The Voice Activity Detection layer also required non-obvious tuning. LiveKit’s WebRTC audio transport applies compression that changes the amplitude envelope VAD models use to detect speech boundaries. Default VAD parameters produce false positives on compressed audio or cut the user off mid-sentence. We calibrated to confidence=0.7, start_secs=0.2, stop_secs=0.2, min_volume=0.6 for Indian conversational speech over LiveKit. Plan for calibration time in any WebRTC deployment.

The Intent Protocol: UI Before the Conversation Ends

The most unusual part of this system is not the voice pipeline — it is how the UI reacts to it.

Standard voice assistants update the screen after the response is delivered. We wanted the UI card to appear the instant the assistant detects intent, before it asks a single clarifying question. We call this signal-first.

The rule is enforced in the system instruction: the LLM must call signal_intent before speaking any words. That fires an intent_queued event over the WebRTC data channel; the frontend renders the card in milliseconds. As the conversation continues, update_intent_params fills the card progressively. When all details are gathered, the final tool call updates the same Firestore record in-place, no duplicates. The user confirms in the UI, and the action is dispatched.

The result: the card appears and fills in while the voice conversation is still happening. The user sees and hears the assistant working in parallel, which significantly reduces perceived latency even though the underlying pipeline timing is unchanged.

Server-Driven UI: Rendering Components From the LLM

Static screens cannot handle a voice agent whose output is unpredictable.

We built a server-driven UI layer instead. The backend generates a UISchema, a JSON tree of typed block nodes, and sends it to the client. The frontend renders it using a recursive Renderer backed by a registry of 27 block types: text, button, card, list, alert, datetime_picker, map, and more. The LLM returns structured JSON; it never writes React components. Adding a new capability means registering a new block and action handler; the LLM can use it without a client release.

This is distinct from Micro-Frontend architecture, which solves team modularity at build time. Our SDUI layer solves runtime UI generation; the screen content is decided by the AI at runtime, based on what the user just said.

The Real Numbers: Latency and Cost

Latency breakdown by stage and language (cascade pipeline, observed medians)

Stage	English	Hindi	Hinglish	Marathi
STT (Sarvam)	1.9s	3.5s	2.5s	3.0s
LLM (Gemini)	0.96s	1.09s	0.7s	1.13s
TTS (Sarvam)	~0.7s	~0.7s	~0.7s	~0.7s
Total	~3.6s	~5.3s	~3.9s	~4.8s

STT dominates the latency budget. The LLM, even cloud-hosted, contributes less than a second. TTS is not the bottleneck. Any effort to reduce perceived latency must focus on the STT layer first.

Cost per one-minute conversation:

Service	Provider	Cost (INR)
STT — saaras:v3	Sarvam (Rs.30/hr audio)	~Rs.0.17
LLM — Gemini 3.1 Flash Lite	OpenRouter ($0.25/M input tokens)	~Rs.1.07
TTS — bulbul:v3	Sarvam (Rs.30/10K chars)	~Rs.0.93
Cascade total		~Rs.2.17 / min
S2S — gpt-realtime-2	OpenAI	~Rs.12.49 / min

At scale, the 6x cost difference between cascade and S2S is significant. For a product targeting extended voice sessions or high call volumes, cascade is the financially sustainable path.

Lessons Learned

Accuracy beats speed when your output is a task, not text. A text chatbot can recover from a misunderstood query in the next turn. A voice agent that mishears “9 AM” as “9 PM” and creates the wrong reminder cannot easily recover without breaking conversational flow. Build your accuracy floor before optimizing latency.

Silent TTS failures are harder to debug than loud ones. A wrong response is noticed. Silence is blamed on the network, the device, or the speaker. By the time it gets to engineering, the log context is gone. Instrument TTS TTFB as a separate metric with explicit alerts.

VAD calibration is not optional over WebRTC. Compressed audio changes amplitude characteristics in ways that default VAD thresholds do not account for. Budget calibration time for every new deployment environment.

Fire-and-forget async DB writes keep the pipeline clean. All transcript saves, session records, and intent writes are non-blocking asyncio.create_task calls. Database errors are logged but never allowed to surface in the live call. A DB write should never drop an audio frame.

The signal-first rule is a UX rule, not just a technical one. Enforcing intent signaling before speech changes user perception of the system’s intelligence. The timing of UI updates matters as much as their content.

Two architectures in one codebase is a feature, not complexity. Having both cascade and S2S available behind a single BOT_MODE environment variable lets you switch modes without touching client code. For a POC that needed to evaluate both paths with the same frontend, this saved weeks.

What's Next

Observability is the next priority. Each pipeline stage needs independent latency and error-rate metrics, STT TTFB, LLM time-to-first-token, TTS TTFB, so regressions in any layer are immediately visible. Aggregate call-level logs are not sufficient for a production voice system.

RAG for session memory is the highest-impact intelligence improvement. The assistant currently has no memory across sessions. A retrieval layer over past session transcripts would enable follow-up logic like “Last week you mentioned calling the repair shop — did that get resolved?”, the kind of continuity that makes an assistant feel like a genuine productivity partner rather than a stateless query engine.

Gemini Live is under evaluation. Google’s speech-to-speech model has better Hindi support than OpenAI Realtime and may close the multilingual accuracy gap at a lower cost point. If it does, S2S becomes viable for this use case.

Platform extraction, the pipeline, the intent protocol, the SDUI layer, and the tool system are all sufficiently decoupled to be extracted into reusable modules that other products can use without rebuilding the infrastructure.

Want to Build Something Like This?

Voice-first AI for Indian languages is a solvable problem, but it requires precise decisions at every layer: which STT provider, which LLM, how to tune VAD, how to design the UI interaction model, and how to trade latency against accuracy for task-oriented use cases. Generic voice AI platforms make those decisions for you, and often get them wrong for multilingual, Indic-language workloads.

If you are building a voice AI product and want to avoid the experiments we already ran, or accelerate past them, we are glad to talk through the architecture.

FAQs

What is the difference between a cascade voice pipeline and speech-to-speech?

Cascade chains three services, STT, LLM, and TTS, sequentially. S2S uses one model for the full audio-in to audio-out flow. Cascade is slower but gives more control, better multilingual accuracy, and costs roughly 6x less per minute.

Why not use on-device STT to cut latency?

On-device STT is fast but optimized for command-style English. For Hindi, Marathi, and Hinglish natural conversational speech, accuracy degrades to the point where task creation becomes unreliable. A wrong transcript produces a wrong task; speed without accuracy solves nothing.

What is the signal-first intent protocol?

A design rule requiring the AI to signal its detected intent (trigger a UI card) before speaking a word or asking a clarifying question. This fires an event over the WebRTC data channel and updates the UI in ~50ms, making the interaction feel parallel rather than sequential.

What is a server-driven UI in a voice agent context?

The backend generates a JSON block tree (UISchema) describing what to render. The frontend renders it using a registry of pre-built components. The LLM decides at runtime what UI to show without any client release cycle.

How do you handle mid-conversation language switching?

Sarvam AI’s STT is trained on code-switching audio, sentences that mix Hindi and English within a single utterance. The LLM receives the mixed-language transcript and replies in the language of the user’s most recent message. Both layers must support code-switching independently.

What does it cost to run a one-minute session?

Cascade: approximately Rs. 2.17/min (STT Rs. 0.17 + LLM Rs. 1.07 + TTS Rs. 0.93). S2S with OpenAI gpt-realtime-2: approximately Rs. 12.49/min. At scale, cascade is the financially sustainable default.

Why Do We Choose Cascade Over OpenAI Realtime?

We chose Cascade because it is more reliable for multilingual task execution.

OpenAI Realtime feels faster, but during testing, it sometimes gave incorrect answers, switched languages, and made grammatical mistakes in Hindi/Marathi. Cascade was slightly slower, but it gave more accurate, consistent, and trustworthy responses for English, Hindi, Marathi, and Hinglish. For this project, correct task creation matters more than a 1-second speed gain

Which STT Works Best for Hindi, Hinglish, and Marathi?

Sarvam STT works best for Hindi, Hinglish, and Marathi.

Deepgram was very fast for English, but it failed or performed poorly for Indic and mixed-language speech. Sarvam was slower, but it handled Hindi, Marathi, and Hinglish much better, which makes it the better choice for this voice agent.

What Does a Production Multilingual Voice Pipeline Actually Cost?

The current preferred production pipeline is:

Sarvam STT → Gemini 3.1 Flash-Lite → Sarvam TTS

The Sarvam AI cascade conversation costs around ₹2 per minute. This is cheaper than OpenAI Realtime, which was noted as around 5x higher cost per minute. So, for a multilingual Indian voice assistant, Cascade is the more cost-effective production choice.