Instructions to use yuriyvnv/Qwen3-ASR-1.7B-EN-Medical with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yuriyvnv/Qwen3-ASR-1.7B-EN-Medical with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="yuriyvnv/Qwen3-ASR-1.7B-EN-Medical")# Load model directly from transformers import Qwen3ASRForTraining model = Qwen3ASRForTraining.from_pretrained("yuriyvnv/Qwen3-ASR-1.7B-EN-Medical", dtype="auto") - Notebooks
- Google Colab
- Kaggle
🎙️ Qwen3-ASR-1.7B-EN-Medical — English Speech Recognition
A medical-domain English automatic speech recognition (ASR) model, fine-tuned from Qwen/Qwen3-ASR-1.7B on the English subset of MultiMed mixed with Common Voice 17 English (train + validation). It outputs cased, punctuated English text and works as a drop-in replacement for the base model.
On MultiMed English (test) it reaches 16.50% normalized WER, essentially tied with the published MultiMed paper SOTA (16.62%). On Common Voice 17 English (test) it improves to 6.68% normalized WER vs the base model's 7.54%.
📊 Results
WER and CER on two held-out test sets — medical (in-domain) and general English (out-of-domain). All numbers are normalized (lowercase + strip punctuation), the standard protocol used by the MultiMed paper and the Open ASR Leaderboard, so they are directly comparable to other published results. "Zero-shot" is the unmodified Qwen/Qwen3-ASR-1.7B.
| Test set | Samples | Zero-shot WER | Fine-tuned WER | Δ WER | Zero-shot CER | Fine-tuned CER |
|---|---|---|---|---|---|---|
| MultiMed English (test) | 7,567 | 16.41 | 16.50 | +0.09 | 12.60 | 12.45 |
| Common Voice 17 EN (test) | 16,393 | 7.54 | 6.68 | -0.86 | 3.68 | 3.29 |
For reference, the MultiMed paper's best published result is Whisper-Small multilingual fine-tune at 16.62% WER (arXiv 2409.14074, Table 6). Both the base Qwen3-ASR and this fine-tune match that level.
The interesting story here is general English. The fine-tune actually improves on the base model's CV17 WER by 0.86 absolute points / 11% relative, while preserving medical performance. That's the opposite of catastrophic forgetting — including Common Voice in the training mix kept the base distribution intact and the medical exposure didn't hurt anything.
🧹 Reference / target normalisation
MultiMed transcripts are real-world clinical speech and inconsistent in casing and trailing punctuation. To give the model a clean, predictable target distribution we apply a small, deterministic written-form normalisation to every reference at load time, both during training and during evaluation:
- Capitalise the first letter if it is lowercase.
- Collapse trailing dots — any sequence of
.,…,..,...at the end is replaced with a single.. - Append a terminal period if the sentence does not already end in
terminal punctuation (
. ! ? …) or a closing bracket / quote () ] } " 'etc.).
The exact function lives in src/evaluation/score_written_form.py of the
project repository. Concretely:
| Raw reference | Normalised |
|---|---|
the patient presented with chest pain |
The patient presented with chest pain. |
TAVI is indicated for severe aortic stenosis... |
TAVI is indicated for severe aortic stenosis. |
What is the dosage? |
What is the dosage? (unchanged) |
Because the same normalisation is applied to references used for the zero-shot baseline above, the gain reported in the results table reflects the fine-tune itself — not a metric quirk caused by mismatched references.
🚀 How to use
Install the official qwen-asr package, then load this model exactly the
same way you would load the base Qwen3-ASR:
pip install qwen-asr
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"yuriyvnv/Qwen3-ASR-1.7B-EN-Medical",
dtype=torch.bfloat16,
device_map="cuda:0",
)
result = model.transcribe(audio="audio.wav", language="English")
print(result[0].text)
Batch inference, automatic language detection, streaming, and vLLM serving all work identically to the base model — see the upstream Qwen3-ASR documentation for details.
🛠️ Training
Datasets:
- leduckhai/MultiMed (English) — medical-domain speech (~84h, 25,497 clips after filtering)
- fixie-ai/common_voice_17_0 (en) train + validation splits — Common Voice 17, crowdsourced English (~1.04M clips)
Concatenated and shuffled per epoch. CV17 dominates the mix at ~97.5% by clip count, which anchors general English while MultiMed steers the model toward clinical vocabulary.
Validation: MultiMed-en eval split (~2,807 clips) drives best-checkpoint selection. CV17 test stays fully held out.
Recipe: follows the official QwenLM SFT recipe:
| Parameter | Value |
|---|---|
| Learning rate | 2e-05 |
| Scheduler | linear |
| Warmup ratio | 0.02 |
| Per-device batch size | 64 |
| Gradient accumulation | 3 |
| Effective batch size | 192 |
| Epochs | 3 |
| Precision | bf16 mixed |
| Gradient checkpointing | enabled |
| Optimizer | AdamW (fused) |
Trained on a single H100. The best checkpoint was selected by validation loss.
⚠️ Limitations
- Trained on MultiMed medical-domain speech (clinical consultations, surgical procedures, patient narratives). Performance outside medical contexts is not guaranteed to improve over the base model.
- Outputs English text. Cross-lingual or code-switched audio is not targeted.
- Punctuation and casing are best-effort and inherit the inconsistencies of the underlying transcripts (mitigated, but not eliminated, by the normalisation step above).
- Long-form clinical audio (>30s) is filtered out at training time; very long consultations may need to be chunked at inference.
🙏 Acknowledgements
This model would not exist without the work of others. Thank you to:
- The Qwen team at Alibaba Cloud for releasing Qwen3-ASR-1.7B — the backbone of this fine-tune — together with a clean, reproducible SFT recipe.
- Khai Le-Duc and the MultiMed authors for releasing the MultiMed multilingual medical ASR dataset (leduckhai/MultiMed, paper) that made this domain-specialised fine-tune possible.
- Downloads last month
- 103
Model tree for yuriyvnv/Qwen3-ASR-1.7B-EN-Medical
Base model
Qwen/Qwen3-ASR-1.7BDatasets used to train yuriyvnv/Qwen3-ASR-1.7B-EN-Medical
leduckhai/MultiMed
Paper for yuriyvnv/Qwen3-ASR-1.7B-EN-Medical
Evaluation results
- Normalized WER (MultiMed) on MultiMed English (test)test set self-reported16.500
- Normalized CER (MultiMed) on MultiMed English (test)test set self-reported12.450
- Normalized WER (CV17-en) on Common Voice 17.0 English (test)test set self-reported6.680
- Normalized CER (CV17-en) on Common Voice 17.0 English (test)test set self-reported3.290