Canary-Qwen-2.5B Fine-Tuned for ATC ASR (Encoder Unfrozen)

Fine-tuned nvidia/canary-qwen-2.5b on the UWB-ATCC corpus with the FastConformer encoder unfrozen for deeper domain adaptation.

Results

Model	Params Trained	WER
Canary-Qwen (zero-shot)	0	81.49%
Canary-Qwen (LoRA only)	27.8M (0.97%)	23.32%
Canary-Qwen (encoder unfrozen)	838.8M (32.8%)	23.82%
W2V2 Large (no LM)	317M (100%)	14.54%

Key Finding

Training 32.8% of parameters yields nearly identical WER to training 0.97% (LoRA only), suggesting the frozen LLM decoder is the performance bottleneck, not the speech encoder.

Training

Dataset: UWB-ATCC (Prague Airport ATC, 11,543 train / 2,886 test utterances)
Steps: 10,000 | LR: 5e-4 | Warmup: 1,000
LoRA: r=128, alpha=256, targets=[q_proj, v_proj]
Encoder: FastConformer fully unfrozen (trainable)
Strategy: FSDP across 4x RTX 2080 Ti | Precision: fp16-true (eps=1e-4)
Framework: NVIDIA NeMo 2.8.0rc0

Learning Curve

Step	Unfrozen	LoRA-only
500	46.34%	39.14%
2,000	26.67%	30.87%
5,000	24.85%	24.77%
10,000	23.89%	24.53%

Usage

from nemo.collections.speechlm2.models import SALM
import torch

model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')
state = torch.load('consolidated_model.pt', map_location='cpu')
model.load_state_dict(state, strict=False)
model.cuda().eval()

answer_ids = model.generate(
    prompts=[[{
        'role': 'user',
        'content': f'Transcribe the following: {model.audio_locator_tag}',
        'audio': ['atc_audio.wav']
    }]],
    max_new_tokens=128,
)
print(model.tokenizer.ids_to_text(answer_ids[0].cpu()))

Part of

Pilot-to-ATC Research — Comparative evaluation of W2V2 vs Canary-Qwen for ATC domain ASR.

Downloads last month: 24

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for suideepmax/canary-qwen-2.5b-atc-unfrozen

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

nvidia/canary-qwen-2.5b

Finetuned

(2)

this model