You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Parakeet-TDT-0.6B Vietnamese

Fine-tuned NVIDIA Parakeet-TDT-0.6B-v3 for Vietnamese and English automatic speech recognition.

Model Description

  • Architecture: FastConformer encoder + Token-and-Duration Transducer (TDT) decoder
  • Parameters: 0.6B
  • Tokenizer: 8,192-token SentencePiece BPE
  • Base model: nvidia/parakeet-tdt-0.6b-v3
  • Languages: Vietnamese (primary), English
  • Training data: 8,750h Vietnamese + 1,530h English (~10,280h total)

Evaluation Results

Test Set WER
Validation set 5.68%

Usage

import nemo.collections.asr as nemo_asr

# Load model
model = nemo_asr.models.ASRModel.from_pretrained("nmcuong/parakeet-tdt-0.6b-v3-vietnamese")

# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0].text)

Training Details

  • Base model: nvidia/parakeet-tdt-0.6b-v3
  • Learning rate: 5e-5
  • Batch size: 48 (gradient accumulation: 2, effective batch size: 96)
  • Training steps: 130,000
  • GPUs: 2× H200
  • Precision: bf16-mixed
  • Audio duration filter: min 0.1s, max 30s
  • Max WER filter: 12.5%

Training Data

Dataset Duration
Vietnamese 8,750h
English 1,530h
Total ~10,280h

Acknowledgments

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nmcuong/parakeet-tdt-0.6b-v3-vietnamese

Finetuned
(46)
this model

Evaluation results