CoRal-project/coral
Viewer • Updated • 239k • 375 • 20
How to use pluttodk/milo-asr with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="pluttodk/milo-asr") # Load model directly
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("pluttodk/milo-asr", dtype="auto")Milo-ASR er en dansk speech-to-text model baseret på Qwen3-ASR-1.7B, finetuned til at forstå dansk - både oplæst tale og samtaler/podcasts.
Modellen er trænet på CoRal v2 + danske podcast-data, så den klarer sig godt på tværs af domæner. De fleste andre modeller er kun gode til enten det ene eller det andet.
| Model | WER | CER |
|---|---|---|
| hviske-v2 (Whisper v2) | 17.40% | 7.96% |
| hviske-v3 (Whisper v3) | 21.62% | 9.22% |
| Milo-ASR | 23.24% | 11.17% |
| Whisper v3 Turbo | 40.35% | 15.51% |
| Qwen3-ASR base | 46.28% | 19.78% |
| Model | WER | CER |
|---|---|---|
| Milo-ASR | 21.82% | 15.64% |
| hviske-v2 (Whisper v2) | 50.67% | 38.31% |
| Whisper v3 Turbo | 67.03% | 45.98% |
| Qwen3-ASR base | 67.52% | 47.71% |
| hviske-v3 (Whisper v3) | 67.65% | 50.12% |
Milo-ASR er den eneste model der klarer begge domæner godt. På podcasts er den 2.3x bedre end næstbedste model (hviske-v2).
pip install qwen-asr transformers torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"pluttodk/Milo-ASR",
dtype="bfloat16",
device_map="cuda:0",
)
results = model.transcribe(
audio="path/to/danish_audio.wav",
language="Danish",
)
print(results[0].text)
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = model.transcribe(audio=audio_files, language="Danish")
for r in results:
print(r.text)
model = Qwen3ASRModel.from_pretrained(
"pluttodk/Milo-ASR",
forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
dtype="bfloat16",
device_map="cuda:0",
)
results = model.transcribe(
audio="path/to/audio.wav",
language="Danish",
return_time_stamps=True,
)
for item in results[0].time_stamps.items:
print(f"{item.start_time:.2f}s - {item.end_time:.2f}s: {item.text}")
model = Qwen3ASRModel.LLM(
model="pluttodk/Milo-ASR",
gpu_memory_utilization=0.8,
)
state = model.init_streaming_state(language="Danish", chunk_size_sec=2.0)
for audio_chunk in audio_stream():
state = model.streaming_transcribe(audio_chunk, state)
print(state.text)
state = model.finish_streaming_transcribe(state)
Modellen er finetuned i to stages:
| Parameter | Stage 2 |
|---|---|
| Learning rate | 1e-5 |
| Batch size | 8 (x4 grad acc = 32 effective) |
| Epochs | 8 |
| LR scheduler | Cosine |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Precision | bfloat16 |
| Training steps | 35,560 |
@misc{Milo-ASR,
author = {Rønnelund, Mathias Oliver Valdbjørn},
title = {Milo-ASR: Danish ASR Model based on Qwen3-ASR},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/pluttodk/Milo-ASR}
}
Base model
Qwen/Qwen3-ASR-1.7B