Per-event LoRA adapters for Whisper ASR

Smart glasses hear the world through your ears — conference names, speaker names, product jargon that frontier ASR models have never seen. At Gobi I built a per-event adaptation pipeline: for each conference domain, a lightweight LoRA adapter (r=16, α=32 on q_proj/v_proj) fine-tuned on fully synthetic speech.

The synthetic data pipeline generates entity-rich utterances with Gemini 2.5 and voices them with Cloud TTS Chirp 3 HD — roughly 26,000 utterances (55 hours) across five conference domains, no human recordings required. A failure-mining and contrastive-dataset loop feeds real transcription misses back into training data generation, so each adapter targets the exact entities the base model gets wrong. The adapters fixed 57 entity errors and cut entity-level WER from 9.91% to 8.46% raw (5.58% canonicalized), while the swap-in adapter design keeps base-model performance untouched outside the target domain.

Why it matters: wearables can’t ship a fine-tuned model per customer. Adapters that train on synthetic data in hours and hot-swap per event make domain adaptation operationally cheap.

Related reading: LoRA-Whisper (2406.06619), DAS (2501.12501), synthetic cross-accent augmentation (2303.00802).