Fine-Tuning Whisper for Dysarthric Speech: 47% WER Reduction with 0.73% of Parameters
Whisper achieves near-human accuracy on standard speech. On dysarthric speech, it hallucinates entire sentences. I adapted less than 1% of the model to cut errors in half. Here's the approach, the results, and what I learned.
The problem
OpenAI's Whisper Small was trained on 680,000 hours of internet audio. On healthy speakers from the TORGO dysarthric speech corpus, it achieves 10.6% WER. On dysarthric speakers from the same corpus, it achieves 77.5% WER. One speaker with severe dysarthria (M04) hit 145.6% WER โ the model generates more words than the speaker actually said, hallucinating fluent sentences from unintelligible input.
The root cause is the decoder. Whisper's encoder captures acoustic features reasonably well across speaker types. But the decoder's language model, trained on typical speech patterns, overrides uncertain acoustic evidence with fluent predictions. It "corrects" atypical speech into something a healthy speaker would have said.
The approach
Model: Whisper Small (244M parameters)
Method: LoRA on decoder attention layers only (q_proj, v_proj). Rank 16, alpha 32. This gives 1.77M trainable parameters โ 0.73% of the model.
Why decoder-only? The encoder generalises across speakers. The decoder is where Whisper fails on clinical speech: its language model aggressively normalises atypical productions. Adapting only the decoder teaches the model to trust what the encoder heard rather than rewriting it.
Data: TORGO database โ 5.5 hours of dysarthric speech, 8 hours of healthy controls. Speaker-independent train/test splits (no speaker leakage).
Training: 10 epochs, batch size 16 (effective), cosine LR schedule, fp16. Total training time: 21 minutes on a single NVIDIA B200. Total cloud compute cost: approximately $4.00 on RunPod.
Results
Baseline vs. LoRA
| Baseline | + LoRA | ฮ Relative | |
|---|---|---|---|
| Overall | 31.6% WER | 16.9% WER | โ47% |
| Dysarthric | 77.5% | 40.7% | โ47% |
| Healthy | 10.6% | 6.0% | โ43% |
Per speaker
| Speaker | Status | Baseline WER | LoRA WER | ฮ |
|---|---|---|---|---|
| M04 | severe dysarthria | 145.6% | 65.3% | โ55% |
| F03 | moderate dysarthria | 36.7% | 26.0% | โ29% |
| FC02 | healthy | 9.9% | 6.0% | โ39% |
| MC03 | healthy | 11.4% | 5.8% | โ49% |
Semantic similarity (SemScore)
WER measures surface accuracy. SemScore (cosine similarity of sentence embeddings via all-MiniLM-L6-v2) measures meaning preservation โ critical for clinical speech where surface form varies but intent is consistent.
| Baseline | + LoRA | |
|---|---|---|
| Healthy | 0.80 | 0.92 |
| Dysarthric | 0.47 | 0.64 |
| M04 (severe) | 0.32 | 0.51 |
Convergence speed
The model converged remarkably fast. Performance after just 2 epochs was virtually identical to the final 10-epoch result:
| 2 epochs | 10 epochs | |
|---|---|---|
| Dysarthric WER | 40.5% | 40.7% |
| Healthy WER | 5.9% | 6.0% |
| Overall WER | 16.8% | 16.9% |
The validation WER progression across all 10 epochs confirms the plateau:
| Epoch | Val WER | Train Loss |
|---|---|---|
| 1 | โ | 2.98 โ 0.70 |
| 2 | 0.1021 | 0.52 โ 0.42 |
| 3 | 0.0766 | 0.30 โ 0.26 |
| 4 | 0.0692 | 0.18 โ 0.17 |
| 5 | 0.0639 | 0.11 โ 0.10 |
| 6 | 0.0631 | 0.07 โ 0.07 |
| 7 | 0.0639 | 0.05 โ 0.05 |
| 8 | 0.0628 | 0.04 โ 0.03 |
| 9 | 0.0647 | 0.03 โ 0.03 |
| 10 | 0.0636 | 0.02 โ 0.03 |
The model reaches its best validation WER at epoch 8, with diminishing returns after epoch 3. Train loss continues to decrease but validation WER plateaus โ classic mild overfitting on the training speakers that doesn't generalise further. This has practical implications: per-speaker adaptation could be done in minutes, not hours.
Technical decisions and trade-offs
Why Whisper Small, not Large? Tractability. Small fits in 5GB VRAM, trains on any GPU, and still demonstrates the adaptation pattern. If it works here, it scales up. If it doesn't work here, throwing more parameters at it won't help.
Why LoRA instead of full fine-tuning? Three reasons: (1) avoids catastrophic forgetting on healthy speech โ the base model's encoder stays intact; (2) the adapter is 7MB vs 1GB for a full model โ deployable on device; (3) composable โ you can swap adapters per speaker or condition without retraining the base model.
Why not freeze the encoder entirely? I did. The LoRA targets are decoder-only. The encoder is completely unchanged. This was a deliberate architectural choice: the encoder's acoustic representations generalise across speakers, but the decoder's language model doesn't.
Text normalisation matters more than you'd think. Without normalisation, "dot" vs "Dot." gives 100% WER. The evaluation pipeline strips punctuation, lowercases, and applies Unicode NFKD normalisation before computing WER. SemScore uses raw text since semantic similarity benefits from surface form.
Audio decoding on macOS is a minefield. HuggingFace datasets' default audio decoder (torchcodec) crashes on macOS due to fork() safety. The entire data pipeline uses soundfile with Audio(decode=False) and manual loading โ a pattern that works on macOS, Linux, and cloud GPUs without changes.
Training hardware
Two configurations were used during development:
Debug run (2 epochs):
| GPU | NVIDIA RTX PRO 6000 Blackwell Server Edition (95 GB VRAM) |
| CPU | AMD EPYC 9375F 32-Core (128 logical cores) |
| RAM | 2,267 GB |
| Throughput | 12.1 steps/s, 48 samples/s |
| GPU peak memory | 2,500 MB |
| Total time | 7.3 min |
Full run (10 epochs):
| GPU | NVIDIA B200 (178 GB VRAM) |
| CPU | AMD EPYC 9555 64-Core (224 logical cores) |
| RAM | 2,267 GB |
| Throughput | 9.5 steps/s, 153 samples/s |
| GPU peak memory | 5,636 MB |
| Total time | 21.4 min |
| Best val WER | 0.0628 (epoch 8) |
The model uses under 6 GB of VRAM at peak. Either GPU is massive overkill โ a T4 (16 GB, ~$0.20/hr) or even a free Colab T4 would handle this fine, just slower (~3.5 steps/s vs 9.5). The B200 and RTX PRO 6000 were used because they were considerably faster, not because the workload required them.
Both runs used PyTorch 2.8.0, CUDA 12.8, Transformers 5.3.0, PEFT 0.18.1, and fp16 mixed precision.
Infrastructure
The project runs anywhere with one command:
git clone https://github.com/DavidBarbera/whisper-clinical-speech.git
cd whisper-clinical-speech
./run_all.sh # prepare โ train โ evaluate baseline โ evaluate LoRA โ push to HuggingFace
Key infrastructure choices:
- Timestamped runs โ every train/eval run gets a unique directory under
outputs/. Nothing overwrites. - JSONL streaming โ evaluation writes transcriptions line-by-line. Crash at utterance 4,000? You keep utterances 1โ3,999.
- Auto-resume โ restart evaluation and it picks up where it left off.
- Stable symlinks โ
outputs/checkpoints/latest/finalandoutputs/results/latest_baseline_test/always point to the most recent run. - Platform detection โ macOS gets
num_workers=0,OBJC_DISABLE_INITIALIZE_FORK_SAFETY, soundfile fallback. Linux gets full parallelism. No flags to set. - Hardware logging โ every run logs CPU, GPU, RAM, VRAM, package versions, per-epoch throughput and timing. When comparing runs across machines, you always know the context.
- Performance callback โ custom HuggingFace Trainer callback tracks per-step timing, per-epoch throughput stats (mean/min/max/p50), GPU peak memory, loss trajectory, and ETA.
Total pipeline cost on RunPod: ~$4.00 for data prep + training + both evaluations.
What's next
The immediate next step is cross-domain evaluation on AphasiaBank โ does an adapter trained on motor speech disorders (dysarthria) transfer to language disorders (aphasia)? The decoder hallucination problem is the same, but the speech patterns are different.
Longer term: on-device deployment via ExecuTorch โ running the adapted model on a tablet used in therapy sessions, without cloud dependency. Clinical speech data is sensitive. The less of it that leaves the room, the better.