AI/ML Development Services
[elementor-template id="37232"]
Full-Service Product Studio for Startups
[elementor-template id="37754"]
Developers for Hire for Product Companies
[elementor-template id="38041"]
QA and Software Testing Services
[elementor-template id="38053"]
View All Services
[elementor-template id="38057"]
Author:
The fastest voice architecture we tested was not the one we shipped. And that decision, choosing a slower pipeline over a faster one, turned out to be the most important architectural call in the entire project.
Building a real-time AI voice assistant that understands Hindi, Marathi, Hinglish, and English in the same conversation is a harder problem than it looks. Most voice AI benchmarks measure English latency. Most production voice agents are English-first. The moment you add multilingual support for Indian languages, especially with code-switching, the rules change entirely.
This post is about what we learned during the Ignite Voice Agent POC: the architecture decisions we made, the ones we reversed, the real latency and cost numbers from our benchmarks, and the novel patterns we developed to make voice feel instantaneous even when the AI is still thinking.
To appreciate why runtime validation is necessary, let’s first look at how our agentic chatbot operates.
Unlike static chatbots, our system doesn’t rely only on its pretraining. It uses the agentic tools paradigm.
The chatbot has access to a curated set of tools.
The first question was architectural: should the voice layer be a cascade pipeline or a speech-to-speech model?
A cascade pipeline chains three separate services: a Speech-to-Text model converts audio to text, an LLM processes that text and generates a response, and a Text-to-Speech model converts the response back to audio.


A speech-to-speech (S2S) model — like OpenAI’s Realtime API — handles all three in a single model pass. It hears the user, reasons over the audio directly, and speaks back. No intermediate text. Roughly 3–5x faster on paper.


We tested both. Here is what we actually measured:
| Factor | Cascade | Realtime (OpenAI gpt-realtime-2) |
| Latency | 3–4 seconds | 2–3 seconds |
| Answer accuracy | High, context understood correctly | Sometimes shallow or incorrect |
| Same-language reply | Yes, consistently | Switches language mid-reply |
| Hinglish handling | Correctly interpreted | Weak grammatical mistakes |
| Hindi / Marathi grammar | Correct | Noticeable errors |
| Reliability for tasks | Strong | Inconsistent |
| Cost per minute | ~₹2.17 | ~₹12.49 |
The Realtime model feels faster. It is not, however, more reliable.
When users spoke first in English and then switched to Hindi mid-sentence — a completely natural pattern in Indian conversational speech — the S2S model failed. It responded in the wrong language, misinterpreted intent, or generated grammatically broken Hindi output. For a voice agent whose job is to create structured tasks from natural speech, a mistranscribed or misunderstood command is worse than a slow one. You cannot undo a wrong task by speaking faster.
We chose cascade. The 1–2 second speed gap is perceptible but acceptable. Incorrect task creation is not.
| Approach | Speed | Accuracy | Hinglish | Verdict |
| React Native STT libraries | Slow | Acceptable | Acceptable | No benefit over cloud |
| Android SpeechRecognizer (native) | Very fast | Poor | Fails | Not usable |
| Cloud STT — Deepgram nova-3 | Very fast | Excellent (English only) | Fails | Indic not supported |
| Cloud STT — Sarvam saaras:v3 | 2.7s avg | Excellent | Excellent | Chosen |
The cascade pipeline is not a simple chain. It is a carefully tuned sequence of decisions.
The production stack:
The LLM benchmark was decisive:
| Provider | English | Hindi | Hinglish | Marathi |
| Sarvam AI LLM | 2.3s | 3.3s | 4.7s | 3.9s |
| OpenAI GPT-4o mini | 1.6–3.0s | 1.2s | 1.7s | 1.2s |
| Gemini 3.1 Flash Lite (chosen) | 0.96s | 1.09s | 0.7s | 1.13s |
Gemini 3.1 Flash Lite won across every language category, often by a factor of two or more. Its consistency across languages mattered as much as raw speed — the user experience should not degrade when the user switches from English to Hindi.
One critical production detail that benchmarks never capture: Sarvam’s TTS service has a variable Time-To-First-Byte. Under light load, it sits at 0.5–0.9 seconds. Under load, we observed spikes to 5–7 seconds. The Pipecat library’s default TTS timeout is 3 seconds. Result: under load, the TTS response dropped silently, the assistant went quiet with no error, no exception, no log entry. Setting the TTS timeout to 10 seconds resolved the issue entirely. Monitor your TTS TTFB independently in production.
The Voice Activity Detection layer also required non-obvious tuning. LiveKit’s WebRTC audio transport applies compression that changes the amplitude envelope VAD models use to detect speech boundaries. Default VAD parameters produce false positives on compressed audio or cut the user off mid-sentence. We calibrated to confidence=0.7, start_secs=0.2, stop_secs=0.2, min_volume=0.6 for Indian conversational speech over LiveKit. Plan for calibration time in any WebRTC deployment.
The most unusual part of this system is not the voice pipeline — it is how the UI reacts to it.
Standard voice assistants update the screen after the response is delivered. We wanted the UI card to appear the instant the assistant detects intent, before it asks a single clarifying question. We call this signal-first.
The rule is enforced in the system instruction: the LLM must call signal_intent before speaking any words. That fires an intent_queued event over the WebRTC data channel; the frontend renders the card in milliseconds. As the conversation continues, update_intent_params fills the card progressively. When all details are gathered, the final tool call updates the same Firestore record in-place, no duplicates. The user confirms in the UI, and the action is dispatched.
The result: the card appears and fills in while the voice conversation is still happening. The user sees and hears the assistant working in parallel, which significantly reduces perceived latency even though the underlying pipeline timing is unchanged.
Static screens cannot handle a voice agent whose output is unpredictable.
We built a server-driven UI layer instead. The backend generates a UISchema, a JSON tree of typed block nodes, and sends it to the client. The frontend renders it using a recursive Renderer backed by a registry of 27 block types: text, button, card, list, alert, datetime_picker, map, and more. The LLM returns structured JSON; it never writes React components. Adding a new capability means registering a new block and action handler; the LLM can use it without a client release.
This is distinct from Micro-Frontend architecture, which solves team modularity at build time. Our SDUI layer solves runtime UI generation; the screen content is decided by the AI at runtime, based on what the user just said.
Latency breakdown by stage and language (cascade pipeline, observed medians)
| Stage | English | Hindi | Hinglish | Marathi |
| STT (Sarvam) | 1.9s | 3.5s | 2.5s | 3.0s |
| LLM (Gemini) | 0.96s | 1.09s | 0.7s | 1.13s |
| TTS (Sarvam) | ~0.7s | ~0.7s | ~0.7s | ~0.7s |
| Total | ~3.6s | ~5.3s | ~3.9s | ~4.8s |
STT dominates the latency budget. The LLM, even cloud-hosted, contributes less than a second. TTS is not the bottleneck. Any effort to reduce perceived latency must focus on the STT layer first.
Cost per one-minute conversation:
| Service | Provider | Cost (INR) |
| STT — saaras:v3 | Sarvam (Rs.30/hr audio) | ~Rs.0.17 |
| LLM — Gemini 3.1 Flash Lite | OpenRouter ($0.25/M input tokens) | ~Rs.1.07 |
| TTS — bulbul:v3 | Sarvam (Rs.30/10K chars) | ~Rs.0.93 |
| Cascade total | ~Rs.2.17 / min | |
| S2S — gpt-realtime-2 | OpenAI | ~Rs.12.49 / min |
At scale, the 6x cost difference between cascade and S2S is significant. For a product targeting extended voice sessions or high call volumes, cascade is the financially sustainable path.
Accuracy beats speed when your output is a task, not text. A text chatbot can recover from a misunderstood query in the next turn. A voice agent that mishears “9 AM” as “9 PM” and creates the wrong reminder cannot easily recover without breaking conversational flow. Build your accuracy floor before optimizing latency.
Silent TTS failures are harder to debug than loud ones. A wrong response is noticed. Silence is blamed on the network, the device, or the speaker. By the time it gets to engineering, the log context is gone. Instrument TTS TTFB as a separate metric with explicit alerts.
VAD calibration is not optional over WebRTC. Compressed audio changes amplitude characteristics in ways that default VAD thresholds do not account for. Budget calibration time for every new deployment environment.
Fire-and-forget async DB writes keep the pipeline clean. All transcript saves, session records, and intent writes are non-blocking asyncio.create_task calls. Database errors are logged but never allowed to surface in the live call. A DB write should never drop an audio frame.
The signal-first rule is a UX rule, not just a technical one. Enforcing intent signaling before speech changes user perception of the system’s intelligence. The timing of UI updates matters as much as their content.
Two architectures in one codebase is a feature, not complexity. Having both cascade and S2S available behind a single BOT_MODE environment variable lets you switch modes without touching client code. For a POC that needed to evaluate both paths with the same frontend, this saved weeks.
Observability is the next priority. Each pipeline stage needs independent latency and error-rate metrics, STT TTFB, LLM time-to-first-token, TTS TTFB, so regressions in any layer are immediately visible. Aggregate call-level logs are not sufficient for a production voice system.
RAG for session memory is the highest-impact intelligence improvement. The assistant currently has no memory across sessions. A retrieval layer over past session transcripts would enable follow-up logic like “Last week you mentioned calling the repair shop — did that get resolved?”, the kind of continuity that makes an assistant feel like a genuine productivity partner rather than a stateless query engine.
Gemini Live is under evaluation. Google’s speech-to-speech model has better Hindi support than OpenAI Realtime and may close the multilingual accuracy gap at a lower cost point. If it does, S2S becomes viable for this use case.
Platform extraction, the pipeline, the intent protocol, the SDUI layer, and the tool system are all sufficiently decoupled to be extracted into reusable modules that other products can use without rebuilding the infrastructure.
Voice-first AI for Indian languages is a solvable problem, but it requires precise decisions at every layer: which STT provider, which LLM, how to tune VAD, how to design the UI interaction model, and how to trade latency against accuracy for task-oriented use cases. Generic voice AI platforms make those decisions for you, and often get them wrong for multilingual, Indic-language workloads.
If you are building a voice AI product and want to avoid the experiments we already ran, or accelerate past them, we are glad to talk through the architecture.
Cascade chains three services, STT, LLM, and TTS, sequentially. S2S uses one model for the full audio-in to audio-out flow. Cascade is slower but gives more control, better multilingual accuracy, and costs roughly 6x less per minute.
On-device STT is fast but optimized for command-style English. For Hindi, Marathi, and Hinglish natural conversational speech, accuracy degrades to the point where task creation becomes unreliable. A wrong transcript produces a wrong task; speed without accuracy solves nothing.
A design rule requiring the AI to signal its detected intent (trigger a UI card) before speaking a word or asking a clarifying question. This fires an event over the WebRTC data channel and updates the UI in ~50ms, making the interaction feel parallel rather than sequential.
The backend generates a JSON block tree (UISchema) describing what to render. The frontend renders it using a registry of pre-built components. The LLM decides at runtime what UI to show without any client release cycle.
Sarvam AI’s STT is trained on code-switching audio, sentences that mix Hindi and English within a single utterance. The LLM receives the mixed-language transcript and replies in the language of the user’s most recent message. Both layers must support code-switching independently.
Cascade: approximately Rs. 2.17/min (STT Rs. 0.17 + LLM Rs. 1.07 + TTS Rs. 0.93). S2S with OpenAI gpt-realtime-2: approximately Rs. 12.49/min. At scale, cascade is the financially sustainable default.
We chose Cascade because it is more reliable for multilingual task execution.
OpenAI Realtime feels faster, but during testing, it sometimes gave incorrect answers, switched languages, and made grammatical mistakes in Hindi/Marathi. Cascade was slightly slower, but it gave more accurate, consistent, and trustworthy responses for English, Hindi, Marathi, and Hinglish. For this project, correct task creation matters more than a 1-second speed gain
Sarvam STT works best for Hindi, Hinglish, and Marathi.
Deepgram was very fast for English, but it failed or performed poorly for Indic and mixed-language speech. Sarvam was slower, but it handled Hindi, Marathi, and Hinglish much better, which makes it the better choice for this voice agent.
The current preferred production pipeline is:
Sarvam STT → Gemini 3.1 Flash-Lite → Sarvam TTS
The Sarvam AI cascade conversation costs around ₹2 per minute. This is cheaper than OpenAI Realtime, which was noted as around 5x higher cost per minute. So, for a multilingual Indian voice assistant, Cascade is the more cost-effective production choice.