Fine-Tuning Whisper for Dysarthric Speech: 47% WER Reduction with 0.73% of Parameters

Whisper achieves near-human accuracy on standard speech. On dysarthric speech, it hallucinates entire sentences. I adapted less than 1% of the model to cut errors in half. Here's the approach, the results, and what I learned.

The problem

OpenAI's Whisper Small was trained on 680,000 hours of internet audio. On healthy speakers from the TORGO dysarthric speech corpus, it achieves 10.6% WER. On dysarthric speakers from the same corpus, it achieves 77.5% WER. One speaker with severe dysarthria (M04) hit 145.6% WER — the model generates more words than the speaker actually said, hallucinating fluent sentences from unintelligible input.

The root cause is the decoder. Whisper's encoder captures acoustic features reasonably well across speaker types. But the decoder's language model, trained on typical speech patterns, overrides uncertain acoustic evidence with fluent predictions. It "corrects" atypical speech into something a healthy speaker would have said.

The approach

Model: Whisper Small (244M parameters)

Method: LoRA on decoder attention layers only (q_proj, v_proj). Rank 16, alpha 32. This gives 1.77M trainable parameters — 0.73% of the model.

Why decoder-only? The encoder generalises across speakers. The decoder is where Whisper fails on clinical speech: its language model aggressively normalises atypical productions. Adapting only the decoder teaches the model to trust what the encoder heard rather than rewriting it.

Data: TORGO database — 5.5 hours of dysarthric speech, 8 hours of healthy controls. Speaker-independent train/test splits (no speaker leakage).

Training: 10 epochs, batch size 16 (effective), cosine LR schedule, fp16. Total training time: 21 minutes on a single NVIDIA B200. Total cloud compute cost: approximately $4.00 on RunPod.

Results

Baseline vs. LoRA

	Baseline	+ LoRA	Δ Relative
Overall	31.6% WER	16.9% WER	−47%
Dysarthric	77.5%	40.7%	−47%
Healthy	10.6%	6.0%	−43%

Per speaker

Speaker	Status	Baseline WER	LoRA WER	Δ
M04	severe dysarthria	145.6%	65.3%	−55%
F03	moderate dysarthria	36.7%	26.0%	−29%
FC02	healthy	9.9%	6.0%	−39%
MC03	healthy	11.4%	5.8%	−49%

Semantic similarity (SemScore)

WER measures surface accuracy. SemScore (cosine similarity of sentence embeddings via all-MiniLM-L6-v2) measures meaning preservation — critical for clinical speech where surface form varies but intent is consistent.

	Baseline	+ LoRA
Healthy	0.80	0.92
Dysarthric	0.47	0.64
M04 (severe)	0.32	0.51

Convergence speed

The model converged remarkably fast. Performance after just 2 epochs was virtually identical to the final 10-epoch result:

	2 epochs	10 epochs
Dysarthric WER	40.5%	40.7%
Healthy WER	5.9%	6.0%
Overall WER	16.8%	16.9%

The validation WER progression across all 10 epochs confirms the plateau:

Epoch	Val WER	Train Loss
1	—	2.98 → 0.70
2	0.1021	0.52 → 0.42
3	0.0766	0.30 → 0.26
4	0.0692	0.18 → 0.17
5	0.0639	0.11 → 0.10
6	0.0631	0.07 → 0.07
7	0.0639	0.05 → 0.05
8	0.0628	0.04 → 0.03
9	0.0647	0.03 → 0.03
10	0.0636	0.02 → 0.03

The model reaches its best validation WER at epoch 8, with diminishing returns after epoch 3. Train loss continues to decrease but validation WER plateaus — classic mild overfitting on the training speakers that doesn't generalise further. This has practical implications: per-speaker adaptation could be done in minutes, not hours.

Technical decisions and trade-offs

Why Whisper Small, not Large? Tractability. Small fits in 5GB VRAM, trains on any GPU, and still demonstrates the adaptation pattern. If it works here, it scales up. If it doesn't work here, throwing more parameters at it won't help.

Why LoRA instead of full fine-tuning? Three reasons: (1) avoids catastrophic forgetting on healthy speech — the base model's encoder stays intact; (2) the adapter is 7MB vs 1GB for a full model — deployable on device; (3) composable — you can swap adapters per speaker or condition without retraining the base model.

Why not freeze the encoder entirely? I did. The LoRA targets are decoder-only. The encoder is completely unchanged. This was a deliberate architectural choice: the encoder's acoustic representations generalise across speakers, but the decoder's language model doesn't.

Text normalisation matters more than you'd think. Without normalisation, "dot" vs "Dot." gives 100% WER. The evaluation pipeline strips punctuation, lowercases, and applies Unicode NFKD normalisation before computing WER. SemScore uses raw text since semantic similarity benefits from surface form.

Audio decoding on macOS is a minefield. HuggingFace datasets' default audio decoder (torchcodec) crashes on macOS due to fork() safety. The entire data pipeline uses soundfile with Audio(decode=False) and manual loading — a pattern that works on macOS, Linux, and cloud GPUs without changes.

Training hardware

Two configurations were used during development:

Debug run (2 epochs):


GPU	NVIDIA RTX PRO 6000 Blackwell Server Edition (95 GB VRAM)
CPU	AMD EPYC 9375F 32-Core (128 logical cores)
RAM	2,267 GB
Throughput	12.1 steps/s, 48 samples/s
GPU peak memory	2,500 MB
Total time	7.3 min

Full run (10 epochs):


GPU	NVIDIA B200 (178 GB VRAM)
CPU	AMD EPYC 9555 64-Core (224 logical cores)
RAM	2,267 GB
Throughput	9.5 steps/s, 153 samples/s
GPU peak memory	5,636 MB
Total time	21.4 min
Best val WER	0.0628 (epoch 8)

The model uses under 6 GB of VRAM at peak. Either GPU is massive overkill — a T4 (16 GB, ~$0.20/hr) or even a free Colab T4 would handle this fine, just slower (~3.5 steps/s vs 9.5). The B200 and RTX PRO 6000 were used because they were considerably faster, not because the workload required them.

Both runs used PyTorch 2.8.0, CUDA 12.8, Transformers 5.3.0, PEFT 0.18.1, and fp16 mixed precision.

Infrastructure

The project runs anywhere with one command:

git clone https://github.com/DavidBarbera/whisper-clinical-speech.git
cd whisper-clinical-speech
./run_all.sh          # prepare → train → evaluate baseline → evaluate LoRA → push to HuggingFace

Key infrastructure choices:

Timestamped runs — every train/eval run gets a unique directory under outputs/. Nothing overwrites.
JSONL streaming — evaluation writes transcriptions line-by-line. Crash at utterance 4,000? You keep utterances 1–3,999.
Auto-resume — restart evaluation and it picks up where it left off.
Stable symlinks — outputs/checkpoints/latest/final and outputs/results/latest_baseline_test/ always point to the most recent run.
Platform detection — macOS gets num_workers=0, OBJC_DISABLE_INITIALIZE_FORK_SAFETY, soundfile fallback. Linux gets full parallelism. No flags to set.
Hardware logging — every run logs CPU, GPU, RAM, VRAM, package versions, per-epoch throughput and timing. When comparing runs across machines, you always know the context.
Performance callback — custom HuggingFace Trainer callback tracks per-step timing, per-epoch throughput stats (mean/min/max/p50), GPU peak memory, loss trajectory, and ETA.

Total pipeline cost on RunPod: ~$4.00 for data prep + training + both evaluations.

What's next

The immediate next step is cross-domain evaluation on AphasiaBank — does an adapter trained on motor speech disorders (dysarthria) transfer to language disorders (aphasia)? The decoder hallucination problem is the same, but the speech patterns are different.

Longer term: on-device deployment via ExecuTorch — running the adapted model on a tablet used in therapy sessions, without cloud dependency. Clinical speech data is sensitive. The less of it that leaves the room, the better.

David S. Barbera