Fine-Tuning Whisper for Dysarthric Speech: 47% WER Reduction with 0.73% of Parameters

David Barbera · March 2026


Whisper achieves near-human accuracy on standard speech. On dysarthric speech, it hallucinates entire sentences. I adapted less than 1% of the model to cut errors in half. Here's the approach, the results, and what I learned.

The problem

OpenAI's Whisper Small was trained on 680,000 hours of internet audio. On healthy speakers from the TORGO dysarthric speech corpus, it achieves 10.6% WER. On dysarthric speakers from the same corpus, it achieves 77.5% WER. One speaker with severe dysarthria (M04) hit 145.6% WER — the model generates more words than the speaker actually said, hallucinating fluent sentences from unintelligible input.

The root cause is the decoder. Whisper's encoder captures acoustic features reasonably well across speaker types. But the decoder's language model, trained on typical speech patterns, overrides uncertain acoustic evidence with fluent predictions. It "corrects" atypical speech into something a healthy speaker would have said.

The approach

Model: Whisper Small (244M parameters)

Method: LoRA on decoder attention layers only (q_proj, v_proj). Rank 16, alpha 32. This gives 1.77M trainable parameters — 0.73% of the model.

Why decoder-only? The encoder generalises across speakers. The decoder is where Whisper fails on clinical speech: its language model aggressively normalises atypical productions. Adapting only the decoder teaches the model to trust what the encoder heard rather than rewriting it.

Data: TORGO database — 5.5 hours of dysarthric speech, 8 hours of healthy controls. Speaker-independent train/test splits (no speaker leakage).

Training: 10 epochs, batch size 16 (effective), cosine LR schedule, fp16. Total training time: 21 minutes on a single NVIDIA B200. Total cloud compute cost: approximately $0.50 on RunPod.

Results

Baseline vs. LoRA

Baseline + LoRA Δ Relative
Overall 31.6% WER 16.9% WER −47%
Dysarthric 77.5% 40.7% −47%
Healthy 10.6% 6.0% −43%

Per speaker

Speaker Status Baseline WER LoRA WER Δ
M04 severe dysarthria 145.6% 65.3% −55%
F03 moderate dysarthria 36.7% 26.0% −29%
FC02 healthy 9.9% 6.0% −39%
MC03 healthy 11.4% 5.8% −49%

Semantic similarity (SemScore)

WER measures surface accuracy. SemScore (cosine similarity of sentence embeddings via all-MiniLM-L6-v2) measures meaning preservation — critical for clinical speech where surface form varies but intent is consistent.

Baseline + LoRA
Healthy 0.80 0.92
Dysarthric 0.47 0.64
M04 (severe) 0.32 0.51

Convergence speed

The model converged remarkably fast. Performance after just 2 epochs was virtually identical to the final 10-epoch result:

2 epochs 10 epochs
Dysarthric WER 40.5% 40.7%
Healthy WER 5.9% 6.0%
Overall WER 16.8% 16.9%

The validation WER progression across all 10 epochs confirms the plateau:

Epoch Val WER Train Loss
1 2.98 → 0.70
2 0.1021 0.52 → 0.42
3 0.0766 0.30 → 0.26
4 0.0692 0.18 → 0.17
5 0.0639 0.11 → 0.10
6 0.0631 0.07 → 0.07
7 0.0639 0.05 → 0.05
8 0.0628 0.04 → 0.03
9 0.0647 0.03 → 0.03
10 0.0636 0.02 → 0.03

The model reaches its best validation WER at epoch 8, with diminishing returns after epoch 3. Train loss continues to decrease but validation WER plateaus — classic mild overfitting on the training speakers that doesn't generalise further. This has practical implications: per-speaker adaptation could be done in minutes, not hours.

Technical decisions and trade-offs

Why Whisper Small, not Large? Tractability. Small fits in 5GB VRAM, trains on any GPU, and still demonstrates the adaptation pattern. If it works here, it scales up. If it doesn't work here, throwing more parameters at it won't help.

Why LoRA instead of full fine-tuning? Three reasons: (1) avoids catastrophic forgetting on healthy speech — the base model's encoder stays intact; (2) the adapter is 7MB vs 1GB for a full model — deployable on device; (3) composable — you can swap adapters per speaker or condition without retraining the base model.

Why not freeze the encoder entirely? I did. The LoRA targets are decoder-only. The encoder is completely unchanged. This was a deliberate architectural choice: the encoder's acoustic representations generalise across speakers, but the decoder's language model doesn't.

Text normalisation matters more than you'd think. Without normalisation, "dot" vs "Dot." gives 100% WER. The evaluation pipeline strips punctuation, lowercases, and applies Unicode NFKD normalisation before computing WER. SemScore uses raw text since semantic similarity benefits from surface form.

Audio decoding on macOS is a minefield. HuggingFace datasets' default audio decoder (torchcodec) crashes on macOS due to fork() safety. The entire data pipeline uses soundfile with Audio(decode=False) and manual loading — a pattern that works on macOS, Linux, and cloud GPUs without changes.

Training hardware

Two configurations were used during development:

Debug run (2 epochs):

GPU NVIDIA RTX PRO 6000 Blackwell Server Edition (95 GB VRAM)
CPU AMD EPYC 9375F 32-Core (128 logical cores)
RAM 2,267 GB
Throughput 12.1 steps/s, 48 samples/s
GPU peak memory 2,500 MB
Total time 7.3 min

Full run (10 epochs):

GPU NVIDIA B200 (178 GB VRAM)
CPU AMD EPYC 9555 64-Core (224 logical cores)
RAM 2,267 GB
Throughput 9.5 steps/s, 153 samples/s
GPU peak memory 5,636 MB
Total time 21.4 min
Best val WER 0.0628 (epoch 8)

The model uses under 6 GB of VRAM at peak. Either GPU is massive overkill — a T4 (16 GB, ~$0.20/hr) or even a free Colab T4 would handle this fine, just slower (~3.5 steps/s vs 9.5). The B200 and RTX PRO 6000 were used because they were available on RunPod at the time, not because the workload required them.

Both runs used PyTorch 2.8.0, CUDA 12.8, Transformers 5.3.0, PEFT 0.18.1, and fp16 mixed precision.

Infrastructure

The project runs anywhere with one command:

git clone https://github.com/DavidBarbera/whisper-clinical-speech.git
cd whisper-clinical-speech
./run_all.sh          # prepare → train → evaluate baseline → evaluate LoRA → push to HuggingFace

Key infrastructure choices:

Total pipeline cost on RunPod: ~$0.50 for data prep + training + both evaluations.

What's next

The immediate next step is cross-domain evaluation on AphasiaBank — does an adapter trained on motor speech disorders (dysarthria) transfer to language disorders (aphasia)? The decoder hallucination problem is the same, but the speech patterns are different.

Longer term: on-device deployment via ExecuTorch — running the adapted model on a tablet used in therapy sessions, without cloud dependency. Clinical speech data is sensitive. The less of it that leaves the room, the better.

Links


David Barbera — PhD Cognitive Neuroscience (UCL). Built a CE-marked Class II medical device for aphasia assessment. Published at INTERSPEECH. Focused on speech AI for clinical populations and on-device deployment.

davidbarbera.github.io · Google Scholar · GitHub · HuggingFace