🎙️ Qwen3-ASR-1.7B-EN-Medical — English Speech Recognition

A medical-domain English automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B on the English subset of MultiMed mixed with Common Voice 17 English (train + validation). It outputs cased, punctuated English text and works as a drop-in replacement for the base model.

On MultiMed English (test) it reaches 16.50% normalized WER, essentially tied with the published MultiMed paper SOTA (16.62%). On Common Voice 17 English (test) it improves to 6.68% normalized WER vs the base model's 7.54%.

📊 Results

WER and CER on two held-out test sets — medical (in-domain) and general English (out-of-domain). All numbers are normalized (lowercase + strip punctuation), the standard protocol used by the MultiMed paper and the Open ASR Leaderboard, so they are directly comparable to other published results. "Zero-shot" is the unmodified Qwen/Qwen3-ASR-1.7B.

Test set	Samples	Zero-shot WER	Fine-tuned WER	Δ WER	Zero-shot CER	Fine-tuned CER
MultiMed English (test)	7,567	16.41	16.50	+0.09	12.60	12.45
Common Voice 17 EN (test)	16,393	7.54	6.68	-0.86	3.68	3.29

For reference, the MultiMed paper's best published result is Whisper-Small multilingual fine-tune at 16.62% WER (arXiv 2409.14074, Table 6). Both the base Qwen3-ASR and this fine-tune match that level.

The interesting story here is general English. The fine-tune actually improves on the base model's CV17 WER by 0.86 absolute points / 11% relative, while preserving medical performance. That's the opposite of catastrophic forgetting — including Common Voice in the training mix kept the base distribution intact and the medical exposure didn't hurt anything.

🧹 Reference / target normalisation

MultiMed transcripts are real-world clinical speech and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:

Capitalise the first letter if it is lowercase.
Collapse trailing dots — any sequence of ., …, .., ... at the end is replaced with a single ..
Append a terminal period if the sentence does not already end in terminal punctuation (. ! ? …) or a closing bracket / quote () ] } " ' etc.).

The exact function lives in src/evaluation/score_written_form.py of the project repository. Concretely:

Raw reference	Normalised
`the patient presented with chest pain`	`The patient presented with chest pain.`
`TAVI is indicated for severe aortic stenosis...`	`TAVI is indicated for severe aortic stenosis.`
`What is the dosage?`	`What is the dosage?` (unchanged)

Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.

🚀 How to use

Install the official qwen-asr package, then load this model exactly the same way you would load the base Qwen3-ASR:

pip install qwen-asr

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "yuriyvnv/Qwen3-ASR-1.7B-EN-Medical",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

result = model.transcribe(audio="audio.wav", language="English")
print(result[0].text)

Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.

🛠️ Training

Datasets:

leduckhai/MultiMed (English) — medical-domain speech (~84h, 25,497 clips after filtering)
fixie-ai/common_voice_17_0 (en) train + validation splits — Common Voice 17, crowdsourced English (~1.04M clips)

Concatenated and shuffled per epoch. CV17 dominates the mix at ~97.5% by clip count, which anchors general English while MultiMed steers the model toward clinical vocabulary.

Validation: MultiMed-en eval split (~2,807 clips) drives best-checkpoint selection. CV17 test stays fully held out.

Recipe: follows the official QwenLM SFT recipe:

Parameter	Value
Learning rate	2e-05
Scheduler	linear
Warmup ratio	0.02
Per-device batch size	64
Gradient accumulation	3
Effective batch size	192
Epochs	3
Precision	bf16 mixed
Gradient checkpointing	enabled
Optimizer	AdamW (fused)

Trained on a single H100. The best checkpoint was selected by validation loss.

⚠️ Limitations

Trained on MultiMed medical-domain speech (clinical consultations, surgical procedures, patient narratives). Performance outside medical contexts is not guaranteed to improve over the base model.
Outputs English text. Cross-lingual or code-switched audio is not targeted.
Punctuation and casing are best-effort and inherit the inconsistencies of the underlying transcripts (mitigated, but not eliminated, by the normalisation step above).
Long-form clinical audio (>30s) is filtered out at training time; very long consultations may need to be chunked at inference.

🙏 Acknowledgements

This model would not exist without the work of others. Thank you to:

The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe.
Khai Le-Duc and the MultiMed authors for releasing the MultiMed multilingual medical ASR dataset (leduckhai/MultiMed, paper) that made this domain-specialised fine-tune possible.

Downloads last month: 103

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for yuriyvnv/Qwen3-ASR-1.7B-EN-Medical

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(64)

this model

Datasets used to train yuriyvnv/Qwen3-ASR-1.7B-EN-Medical

Paper for yuriyvnv/Qwen3-ASR-1.7B-EN-Medical

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Paper • 2409.14074 • Published Sep 21, 2024 • 3

Evaluation results

Normalized WER (MultiMed) on MultiMed English (test)
test set self-reported

16.500
Normalized CER (MultiMed) on MultiMed English (test)
test set self-reported

12.450
Normalized WER (CV17-en) on Common Voice 17.0 English (test)
test set self-reported

6.680
Normalized CER (CV17-en) on Common Voice 17.0 English (test)
test set self-reported

3.290