Canary-Qwen-2.5B Fine-Tuned for ATC ASR (Encoder Unfrozen)
Fine-tuned nvidia/canary-qwen-2.5b on the UWB-ATCC corpus with the FastConformer encoder unfrozen for deeper domain adaptation.
Results
| Model | Params Trained | WER |
|---|---|---|
| Canary-Qwen (zero-shot) | 0 | 81.49% |
| Canary-Qwen (LoRA only) | 27.8M (0.97%) | 23.32% |
| Canary-Qwen (encoder unfrozen) | 838.8M (32.8%) | 23.82% |
| W2V2 Large (no LM) | 317M (100%) | 14.54% |
Key Finding
Training 32.8% of parameters yields nearly identical WER to training 0.97% (LoRA only), suggesting the frozen LLM decoder is the performance bottleneck, not the speech encoder.
Training
- Dataset: UWB-ATCC (Prague Airport ATC, 11,543 train / 2,886 test utterances)
- Steps: 10,000 | LR: 5e-4 | Warmup: 1,000
- LoRA: r=128, alpha=256, targets=[q_proj, v_proj]
- Encoder: FastConformer fully unfrozen (trainable)
- Strategy: FSDP across 4x RTX 2080 Ti | Precision: fp16-true (eps=1e-4)
- Framework: NVIDIA NeMo 2.8.0rc0
Learning Curve
| Step | Unfrozen | LoRA-only |
|---|---|---|
| 500 | 46.34% | 39.14% |
| 2,000 | 26.67% | 30.87% |
| 5,000 | 24.85% | 24.77% |
| 10,000 | 23.89% | 24.53% |
Usage
from nemo.collections.speechlm2.models import SALM
import torch
model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')
state = torch.load('consolidated_model.pt', map_location='cpu')
model.load_state_dict(state, strict=False)
model.cuda().eval()
answer_ids = model.generate(
prompts=[[{
'role': 'user',
'content': f'Transcribe the following: {model.audio_locator_tag}',
'audio': ['atc_audio.wav']
}]],
max_new_tokens=128,
)
print(model.tokenizer.ids_to_text(answer_ids[0].cpu()))
Part of
Pilot-to-ATC Research — Comparative evaluation of W2V2 vs Canary-Qwen for ATC domain ASR.
- Downloads last month
- 24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support