The Demo That Failed
We'd been building Aanchal — an AI learning platform for Indian children — for three weeks. 117 passing tests. A 7-agent pipeline. CSRF on every route. Clean architecture.
Then we put it in front of actual kids.
The first session — a Telugu dragon story — lasted 90 seconds before everything fell apart:
Over the next week, we tested with children aged 3 to 8 across Telugu, Hindi, and English. Every session surfaced a new failure mode. This is the story of those sessions, the migration from Gemini 2.5 to 3.1 Live, and the five audio approaches that failed before one worked.
What We Were Building
Aanchal is an AI learning platform for Indian children (ages 3-12). A parent creates an experience — a branching adventure or an interactive story — and the AI generates it. The child opens their tablet, and a warm voice tells them the story in their mother tongue. The child can talk back, ask questions, make choices, and the story adapts.
The parent controls everything. The child just plays.
Under the hood: the parent's own Google Gemini API key (BYOK model), encrypted at rest, used to create an ephemeral token for the child's device. The child's tablet connects via WebSocket to Gemini Live. Voice in, voice out. Real-time.
Act 1: The 2.5 Struggles
We launched on gemini-2.5-flash-native-audio-preview. It worked in demos. It failed with kids.
Problem 1: VAD Cuts Kids Off
Gemini's Voice Activity Detection defaults are tuned for adults in quiet rooms. A young child in an Indian apartment with TV in the background? The model heard a 100ms pause and assumed the kid was done talking. Mid-sentence.
The fix was supposed to be simple: endOfSpeechSensitivity: LOW, silenceDurationMs: 1000. But on 2.5, VAD parameters were silently ignored for ephemeral token connections. The config was accepted, the behaviour didn't change.
So we built an entire workaround: an 8-second nudge timer. When the AI finished talking and the kid was silent, wait 8 seconds, show a "Tap to continue" overlay, then send synthetic TTS audio to nudge the story forward. The nudge system had its own bugs — timers racing, overlays appearing over choices, TTS playing alongside AI audio.
Problem 2: Context Degradation
One long WebSocket session. Every turn adds to the context. By turn 12, the model started forgetting the plot. By turn 15, it mixed up character names. The system prompt was ~2800 tokens — so dense the model couldn't follow it while also tracking the conversation.
Problem 3: Premature End Detection
How do you know the story is over? We parsed the transcript for "THE END." Sounds simple. In practice:
We built increasingly complex regex patterns. They broke on every new language.
Problem 4: The Model Was Deprecated
On March 29 — the same day as our kid testing disaster — we discovered that gemini-2.5-flash-native-audio-preview-12-2025 had been deprecated 10 days earlier. No migration warning we noticed. The model still responded, but with degraded quality.
Act 2: Five Audio Approaches (Four Failed)
When we migrated to Gemini 3.1 Flash Live (gemini-3.1-flash-live-preview), the first thing that broke was audio playback. 3.1 sends 24kHz PCM. Browsers default to 44.1kHz or 48kHz. The mismatch produced chipmunk voices.
Approach 1: Raw AudioBufferSource per chunk. Create an AudioBuffer for each incoming chunk, play sequentially. Result: clicking artifacts at chunk boundaries. Every 40ms, a tiny pop.
Approach 2: Force AudioContext to 24kHz. new AudioContext({ sampleRate: 24000 }). Chrome accepts this... and then lies. The actual output rate depends on the hardware. On most Android devices, you get 48kHz output with 24kHz data = 2x speed, Mickey Mouse voice.
Approach 3: AudioWorklet at 24kHz. Process PCM in a worklet thread. Same problem — the worklet's callback fires at the hardware rate, not the declared rate. Data plays at 2x.
Approach 4: AudioWorklet with manual linear interpolation. Resample 24kHz → 48kHz in the worklet by interpolating between samples. Technically correct. Audibly worse — resampling artifacts, latency, CPU spikes on low-end Android phones.
Approach 5: WAV header + decodeAudioData. Prepend a WAV header to each chunk declaring "this is 24kHz, mono, 16-bit PCM." Let decodeAudioData() handle the resampling. This worked. The browser reads the header as truth and resamples correctly. Clean audio, no artifacts, works on every device.
The lesson: don't fight the browser's audio stack. Give it metadata and let it do the work.
Act 3: The 3.1 Migration
3.1 isn't a drop-in upgrade from 2.5. Key differences we discovered the hard way:
No proactiveAudio. On 2.5, the model could spontaneously start talking. On 3.1, this doesn't exist. The model only speaks when prompted. Our entire nudge system was built around proactiveAudio — all of it was dead code.
No affectiveDialog. 2.5 had emotional tone adjustment. 3.1 doesn't. Not a big loss — the model is expressive without it.
sendClientContent crashes the constrained endpoint. We used session.sendClientContent() to send text mid-conversation. On 3.1's v1alpha BidiGenerateContentConstrained endpoint, this returns WebSocket close code 1007. The fix: session.sendRealtimeInput({ text: "..." }). Same functionality, different API.
Multi-part server events. On 2.5, audio and transcription came in separate events. On 3.1, a single event can contain audio + transcription + turnComplete. Our event parser assumed one-type-per-event and dropped data.
thinkingConfig uses thinkingLevel, not thinkingBudget. The parameter name changed. We set thinkingLevel: 'minimal' and includeThoughts: false — without this, the model's internal reasoning leaked into the audio. Kids heard "Let me think about what to say next..." before the actual narration.
What 3.1 Fixed
VAD tuning works for ephemeral tokens. We set silenceDurationMs: 1000 and it actually applied. Kids got the 1 second of patience they needed.
Context window compression. contextWindowCompression: { slidingWindow: {} }. Sessions no longer degraded after 15 minutes. The model maintained coherence for the entire story.
Function calling replaced transcript parsing. Instead of parsing "THE END" from audio transcripts, we gave the model a story_complete() tool. When the story is done, it calls the function. Structured, reliable, language-independent. This one change eliminated an entire class of bugs.
Text nudges work mid-session. sendRealtimeInput({ text: "Continue the story." }) works on 3.1. No more synthetic TTS audio nudges. When the kid is silent for 8 seconds, we send a text message. The AI continues narrating. Clean, no audio artifacts, no timing hacks.
Act 4: The Kids That Changed Everything
We ran the same stories on 3.1 with the same group of kids. The difference was immediate.
From Firestore transcripts across sessions:
The model was genuinely adaptive. The compact prompt (600 tokens, down from 2800) gave it room to think. Function calling gave clean lifecycle signals. VAD tuning gave kids patience.
Young children's Telugu was transcribed as Korean, Italian, Hindi, French, Japanese — speech recognition chaos. The AI gracefully extracted the Telugu words and ignored the noise.
Sessions averaged 10-15 minutes. Multiple kids asked to play again immediately.
What We Learned
1. Test with real kids, not simulators. Every child found assumptions we didn't know we had. Pre-readers can't tap text buttons. 100ms silence thresholds cut off kids mid-thought. Beautifully architected nudge systems mean nothing if the model just needs more patience.
2. Less prompt = better. 600 tokens beats 2800. The model needs room to think. Remove stage directions (3.1 reads them aloud). Remove time references (AI can't track minutes). One clear instruction beats ten hedged ones.
3. Function calling > transcript parsing. Structured signals are the only reliable lifecycle mechanism. "THE END" in Telugu across chunked audio events is not a reliable signal. A function call is.
4. Browser audio is a minefield. Don't resample manually. Don't fight AudioContext sample rates. Prepend a WAV header and let decodeAudioData() handle it. Five approaches failed before we found the one that works on every device.
5. Duration is milestones, not minutes. The model can't track time. Give it a number of plot points to hit: "Visit 3 locations, meet 2 characters, solve 1 problem." Story ends when the function is called.
6. Indian households are noisy. Always-on microphone floods the model with TV audio, family conversations, pressure cooker whistles. Give kids a talk button they toggle on and off. Gemini's VAD handles turn-taking within the conversation; the toggle gates the conversation itself.
7. 3.1 is a different product, not an upgrade. No proactiveAudio, no affectiveDialog, different event format, different text API. But the quality improvement is dramatic — if you rebuild around 3.1's strengths instead of patching 2.5 patterns onto it.
The Architecture Today
Parent creates story → 7-agent pipeline generates content → Parent approves
↓
Kid opens tablet → Ephemeral token (30min, 4 uses) → Gemini 3.1 Flash Live
↓
WebSocket: system instruction + scene text → AI narrates + kid talks
↓
story_complete() function call → Learning report → Parent reviewsThe kids portal is a dumb terminal. Zero AI orchestration. Zero analytics. Zero tracking. All intelligence lives server-side, gated by the parent.
If you're building on Gemini Live — especially for non-English speakers, children, or noisy environments — the three things that matter most: WAV headers for audio, function calling for lifecycle, and giving your users more silence patience than you think they need.