Text to Speech: Best Practices
Everything you need to ship production-grade Indian voice with Bulbul V3 — our flagship TTS model natively trained on Indian speech across 11 languages.
Everything you need to ship production-grade Indian voice
Bulbul V3 is our flagship Text to Speech model, natively trained on Indian speech data across 11 languages. Bulbul V3 was built from scratch for the phonology, prosody, and code-mixing patterns of Indian speech.
The model is capable. Getting the most out of it is a configuration question, and the decisions that matter most are the ones this document addresses: which API mode to use, how to tune Pace and Temperature for the right feel, which speakers perform best on which languages, and how to match format and sample rate to the pipeline receiving the audio.
Each section covers one of those decisions. What the parameter controls, why the defaults are set where they are, and where to move them for different contexts.
In this document
- Capabilities at a glance
- Choosing the right API mode
- Pace & Temperature — deep dive
- Output formats & sample rates
- Speaker selection — Tier system & CER
- Language-specific speaker mapping
- Use-case configurations
- Design boundaries & input conventions
Capabilities at a glance
The table below summarises every capability with its practical implication for builders.
| Capability | What it means in practice | How to use it well |
|---|---|---|
| Emotion-Rich Voices | 25+ speakers carry warmth, authority, calm, or energy | Match speaker to use-case tone deliberately; see Section 5 |
| Code-Mixing (Hinglish, Tanglish etc.) | Mixed-language input is handled natively, no preprocessing | Keep Indic words in native script and English in Roman |
| Indian Name Pronunciation | Names, places, cultural terms rendered accurately | No special handling needed |
| Automatic Text Preprocessing | Numbers, dates, ₹, abbreviations, mixed scripts handled by the model | Send text as-is; no manual normalisation required |
| Pace Control (0.5×–2.0×) | Fine-grained speaking speed adjustment | Start at 1.0; tune per context |
| 8 Output Formats | MP3, WAV, AAC, OPUS, FLAC, PCM, MULAW, ALAW | Format drives latency and compatibility |
| Telephony Support | 8kHz output, MULAW/ALAW codecs for IVR and PSTN/G.711 | Pair MULAW/ALAW with 8kHz for full telephony compatibility |
| REST + WebSocket APIs | Two distinct modes optimised for different latency profiles | REST for batch; WebSocket for real-time |
| Scale-Tested | 2B+ characters generated daily in production | Enterprise reliability is proven; rate limits apply at burst scale |
2. Choosing the right API mode
Bulbul V3 exposes two API modes, each optimised for a distinct deployment pattern. The choice between them is an architectural one with real consequences for perceived responsiveness, pipeline design, and user experience.
| | REST API | WebSocket Streaming |
|---|---|---|
| Endpoint | /text-to-speech | wss://api.sarvam.ai/v1/text-to-speech/stream |
| Char limit | 2,500 per API request | — |
| Latency model | Full audio returned at once — noticeable pause before playback begins | Audio chunks stream progressively — playback can begin within milliseconds |
| SDK support | Python · JavaScript | Python · JavaScript |
| Integrations | Async pipelines, batch jobs, offline generation | LiveKit · Pipecat · Custom WebSocket clients |
| Best for | Short pre-known text, notifications, batch content pipelines | Conversational agents, IVR, real-time LLM voice output |
Why streaming transforms voice agent feel
When an LLM generates a response and Bulbul V3 speaks it, the user's perception of fluency depends entirely on how quickly audio begins. REST returns the complete audio file once it's fully generated. WebSocket streaming returns audio chunks as they are produced; the first chunk typically arrives within tens of milliseconds.
The user hears the voice begin before the model has finished generating the full audio file. In a conversational agent, that difference between audio starting immediately and audio starting after full generation is what separates a fluid conversation from one that feels paused.
✅ Best Practice
- For any LLM → TTS pipeline, use WebSocket streaming.
- The first audio chunk arrives within milliseconds, producing a natural conversational feel.
- REST is the right choice for batch jobs, pre-generated content, and notifications — not real-time voice.
REST API — Quick start
from sarvamai import SarvamAI
client = SarvamAI(api_subscription_key="your-api-key")
response = client.text_to_speech.convert(
inputs=["नमस्ते! आपका order 2 दिन में deliver हो जाएगा।"], # Native script + Roman English
target_language_code="hi-IN",
speaker="Priya",
pace=1.0,
temperature=0.6,
model="bulbul:v3",
output_format="mp3",
sample_rate=22050
)WebSocket Streaming - Quick start
from sarvamai import AsyncSarvamAI
import asyncio
async def stream_tts():
client = AsyncSarvamAI(api_subscription_key="your-api-key")
async with client.text_to_speech.stream(
target_language_code="hi-IN",
speaker="Priya",
pace=1.0,
temperature=0.6,
model="bulbul:v3",
output_format="pcm",
sample_rate=16000
) as stream:
await stream.send_text("नमस्ते! मैं आपकी कैसे मदद कर सकती हूँ?")
async for chunk in stream:
audio_player.write(chunk)
asyncio.run(stream_tts())A deep dive into Pace & Temperature
These two parameters are the single most impactful tuning levers available. The difference between a voice that feels natural and one that feels off is almost always traceable to Pace and Temperature, either misconfigured or left at defaults that don't match the use case.
They are independent controls with compounding effects. A high-Temperature voice at Pace 0.8 sounds warm and reflective. The same voice at Pace 1.4 sounds urgent and clipped. Think of them together, not in isolation.
Pace (Range: 0.5 – 2.0)
Pace controls speaking rate relative to the model's natural speed. 1.0 is native. The scale is linear in perception: 1.2 feels roughly 20% faster, not just 'a little quicker'. The practical range most applications should stay within is 0.8 to 1.2. Outside that range, you're solving for specific needs.
| Pace Value | How it sounds | Recommended for | Consider when |
|---|---|---|---|
| 0.5 – 0.7 | Very slow, deliberate | Accessibility tools, elderly users, pronunciation apps | Content where every word matters |
| 0.8 – 0.9 | Relaxed, measured | EdTech narration, meditation, wellness, tutorials | User needs time to absorb |
| 1.0 (Default) | Natural conversational speed | Conversational agents, general-purpose TTS | Starting point for any integration |
| 1.1 | Slightly brisk, professional | Notifications, IVR menus, news briefings | Professional contexts |
| 1.2 – 1.5 | Fast, energetic | Marketing audio, quick summaries | High-engagement content |
| 1.6 – 2.0 | Very fast | Screen readers for power users | User has opted into accelerated playback |
💡 Pro Tip Start at Pace 1.0 for any new integration and adjust based on user feedback. For IVR and professional contexts, 1.1 reads as efficient and confident. Above 1.5, prioritise contexts where the user is in control of playback speed.
4. Output formats & sample rates
Format and sample rate are infrastructure decisions, made early, rarely revisited. The right choice depends on where the audio goes: a real-time voice agent pipeline has different requirements than a podcast or IVR system.
| Format | Encoding | Best for | Notes |
|---|---|---|---|
| MP3 | Lossy | Content delivery, podcasts, notifications | Good compression, universal support |
| WAV | Lossless | High-quality offline, archival | Large files |
| AAC | Lossy | Mobile apps, streaming | Better quality than MP3 at same bitrate |
| OPUS | Lossy | WebRTC, low-latency streaming | Excellent at low bitrates |
| FLAC | Lossless | Studio-quality archival | Large files, lossless |
| PCM | Raw | Real-time agents, direct pipeline | No encoding overhead |
| MULAW | Telephony | IVR, PSTN, G.711 | Pair with 8kHz |
| ALAW | Telephony | IVR, PSTN, G.711 | Pair with 8kHz |
Sample rate guide: 8kHz for telephony (MULAW/ALAW), 16kHz for real-time voice agents (PCM), 22kHz for general content delivery (MP3), 24kHz for high-fidelity content (WAV/AAC).
5. Speaker selection — Tier system & CER
Bulbul V3 includes 25+ speakers across tiers. Character Error Rate (CER) measures pronunciation accuracy — lower is better. Tier 1 speakers have CER below 0.5%, Tier 2 below 1.5%.
6. Language-specific speaker mapping
| Language | Code | Primary Male | Primary Female |
|---|---|---|---|
| English | en-IN | Ratan | Ishita |
| Hindi | hi-IN | Shubh / Ashutosh | Priya / Suhani |
| Telugu | te-IN | Shubh / Ratan | Neha / Priya |
| Kannada | kn-IN | Shubh / Ratan | Neha / Ishita |
| Bengali | bn-IN | Rehan | Roopa / Suhani |
| Tamil | ta-IN | Ratan / Rohan | Ishita / Ritu |
| Odia | od-IN | Shubh | Ritu / Pooja |
| Malayalam | ml-IN | Shubh | Pooja |
| Marathi | mr-IN | Ratan | Priya / Ritu |
| Punjabi | pa-IN | Mani | Roopa / Suhani |
| Gujarati | gu-IN | Ratan | Priya / Ritu |
Top picks — the short version
- Best overall female voice: Priya (CER 0.13%). Works across Hindi, Telugu, Kannada, Tamil, Marathi, and Gujarati.
- Best overall female for English: Ishita (CER 0.13%). Also excellent for Kannada, Tamil, and Bengali.
- Best overall male for South Indian languages: Shubh (CER 0.30%). Covers Hindi, Telugu, Kannada, Odia, and Malayalam.
- Best overall male for East/West Indian languages: Ratan (CER 0.33%). Covers English, Telugu, Kannada, Tamil, Marathi, and Gujarati.
- Best single-language specialist: Mani (CER 0.00%) for Punjabi — near-perfect accuracy.
7. Use-case configurations
The combinations below have been validated in production across the most common Bulbul v3 deployment patterns. They are a starting point and every application will have contextual nuances worth tuning. But starting here means starting close.
| Use Case | Language | Speaker(s) | Pace | Temp | Format | Sample Rate |
|---|---|---|---|---|---|---|
| Voice Agent (Chat) | hi-IN | Priya / Shubh | 1.0 | 0.6 | PCM | 16kHz |
| IVR / Telephony | hi-IN, en-IN | Ratan / Ishita | 1.1 | 0.4 | MULAW | 8kHz |
| EdTech Narration | hi-IN, ta-IN | Shubh / Ishita | 0.9 | 0.6 | MP3 | 22kHz |
| BFSI Notification | hi-IN, en-IN | Ashutosh / Ratan | 1.1 | 0.3 | MP3 | 22kHz |
| Wellness / Meditation | hi-IN | Priya / Suhani | 0.75 | 0.5 | MP3 | 24kHz |
| News Briefing | hi-IN, en-IN | Ratan / Ishita | 1.2 | 0.5 | MP3 | 22kHz |
| Storytelling | hi-IN, bn-IN | Shubh / Roopa | 0.9 | 0.8 | WAV | 24kHz |
| Thriller / Suspense | hi-IN | Varun | 0.9 | 0.8 | MP3 | 24kHz |
| Punjabi App | pa-IN | Mani / Roopa | 1.0 | 0.6 | MP3 | 22kHz |
| Malayalam Agent | ml-IN | Shubh / Pooja | 1.0 | 0.6 | PCM | 16kHz |
Patterns in the table
- Voice agents and conversational apps converge on Temperature 0.6, balanced expressiveness that feels natural without variability that might unsettle users.
- Storytelling and wellness diverge clearly: slower Pace, higher Temperature, higher sample rate. In these contexts, the audio experience is the product and every parameter should serve immersion.
- All real-time agent configurations use PCM at 16kHz. Everything else defaults to MP3 at 22kHz.
- Varun (Thriller/Suspense) is the only configuration with Temperature 0.8 tied to a specific speaker; his dramatic character is designed to work with expressive latitude.
8. Design boundaries & input conventions
Every model has a design envelope — the input conditions and configuration ranges it is optimised for. Understanding Bulbul V3's envelope helps you build applications that stay well within it, where performance is strongest.
The script input convention is the most impactful integration decision
Bulbul V3 is trained on Indian speech data with a clear script convention: Indic words in native script, English words in Roman script. Inputs that follow this convention produce the best output.
Romanised Indic input — transliterating Devanagari or Tamil or Telugu into Roman characters — puts the model outside its training distribution. Sending native-script Indic text keeps it well within it.
| Language | Outside convention ✗ | Within convention ✓ |
|---|---|---|
| Hindi | "Aapka order confirm ho gaya hai" | "आपका order confirm हो गया है" |
| Tamil | "Unga order deliver aayidum" | "உங்க order deliver ஆயிடும்" |
| Mixed | "Meri flight 6 baje hai" | "मेरी flight 6 बजे है" |
Character limits
Both REST and WebSocket modes support up to 2,500 characters per call or session. For longer content — narrations, articles, customer service scripts — this means chunking the input.
Chunk at sentence boundaries, not at character counts. A sentence boundary detector on the input ensures each chunk starts and ends at a natural speech boundary, producing smooth audio across segments.
Speaker-language fit
CER is measured at the speaker level. The same speaker can perform at Tier 1 on one language and lower on another. Always validate the specific speaker-language combination you plan to ship, not just the speaker's overall CER score.
Resources
| Resource | What it covers | Link |
|---|---|---|
| API Documentation | Full reference for all endpoints, parameters, SDKs | docs.sarvam.ai |
| API Dashboard | API keys, usage tracking, free credits | dashboard.sarvam.ai |
| Python SDK | REST, WebSocket, batch, and async | pip install sarvamai |
| JavaScript SDK | Real-time STT/TTS WebSocket, translation, chat | npm install sarvamai |
| LiveKit Integration | Full-duplex voice agents using Sarvam STT + TTS | docs.sarvam.ai/guides/livekit |
| Pipecat Integration | Voice pipelines and conversational bots | docs.sarvam.ai/guides/pipecat |
| Cookbooks & Examples | Voice agents, IVR flows, EdTech bots, and more | docs.sarvam.ai/cookbooks |
| Discord Community | 80K+ developers; support, releases, use cases | discord.gg/sarvam |
| Enterprise Support | Custom pricing, SLAs, on-premises deployment | developer@sarvam.ai |
Quick reference card
API Essentials
| Setting | Value |
|---|---|
| Model | bulbul:v3 |
| REST endpoint | /text-to-speech |
| WebSocket endpoint | wss://api.sarvam.ai/v1/text-to-speech/stream |
| Character limit | 2,500 chars (both modes) |
Defaults
| Parameter | Default |
|---|---|
| Pace | 1.0 |
| Temperature | 0.6 |
| Agent format | PCM @ 16kHz |
| Telephony format | MULAW @ 8kHz |
| Content delivery | MP3 @ 22kHz |
Language → Default Speaker
| Language | Code | Male | Female |
|---|---|---|---|
| English | en-IN | Ratan | Ishita |
| Hindi | hi-IN | Shubh / Ashutosh | Priya / Suhani |
| Telugu | te-IN | Shubh / Ratan | Neha / Priya |
| Kannada | kn-IN | Shubh / Ratan | Neha / Ishita |
| Bengali | bn-IN | Rehan | Roopa / Suhani |
| Tamil | ta-IN | Ratan / Rohan | Ishita / Ritu |
| Odia | od-IN | Shubh | Ritu / Pooja |
| Malayalam | ml-IN | Shubh | Pooja |
| Marathi | mr-IN | Ratan | Priya / Ritu |
| Punjabi | pa-IN | Mani | Roopa / Suhani |
| Gujarati | gu-IN | Ratan | Priya / Ritu |
Curious what else we're building? Explore our APIs and start creating.
Curious what else we're building?
Explore our APIs and start creating.