Canary-Qwen-2.5B Fine-Tuned for ATC ASR (Encoder Unfrozen)

Fine-tuned nvidia/canary-qwen-2.5b on the UWB-ATCC corpus with the FastConformer encoder unfrozen for deeper domain adaptation.

Results

Model Params Trained WER
Canary-Qwen (zero-shot) 0 81.49%
Canary-Qwen (LoRA only) 27.8M (0.97%) 23.32%
Canary-Qwen (encoder unfrozen) 838.8M (32.8%) 23.82%
W2V2 Large (no LM) 317M (100%) 14.54%

Key Finding

Training 32.8% of parameters yields nearly identical WER to training 0.97% (LoRA only), suggesting the frozen LLM decoder is the performance bottleneck, not the speech encoder.

Training

  • Dataset: UWB-ATCC (Prague Airport ATC, 11,543 train / 2,886 test utterances)
  • Steps: 10,000 | LR: 5e-4 | Warmup: 1,000
  • LoRA: r=128, alpha=256, targets=[q_proj, v_proj]
  • Encoder: FastConformer fully unfrozen (trainable)
  • Strategy: FSDP across 4x RTX 2080 Ti | Precision: fp16-true (eps=1e-4)
  • Framework: NVIDIA NeMo 2.8.0rc0

Learning Curve

Step Unfrozen LoRA-only
500 46.34% 39.14%
2,000 26.67% 30.87%
5,000 24.85% 24.77%
10,000 23.89% 24.53%

Usage

from nemo.collections.speechlm2.models import SALM
import torch

model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')
state = torch.load('consolidated_model.pt', map_location='cpu')
model.load_state_dict(state, strict=False)
model.cuda().eval()

answer_ids = model.generate(
    prompts=[[{
        'role': 'user',
        'content': f'Transcribe the following: {model.audio_locator_tag}',
        'audio': ['atc_audio.wav']
    }]],
    max_new_tokens=128,
)
print(model.tokenizer.ids_to_text(answer_ids[0].cpu()))

Part of

Pilot-to-ATC Research — Comparative evaluation of W2V2 vs Canary-Qwen for ATC domain ASR.

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for suideepmax/canary-qwen-2.5b-atc-unfrozen

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(2)
this model