--- language: - ne license: apache-2.0 base_model: openai/whisper-small tags: - whisper - automatic-speech-recognition - nepali - fine-tuned - audio - asr datasets: - amitpant7/nepali-speech-to-text metrics: - wer - cer model-index: - name: whisper-small-nepali results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: amitpant7/nepali-speech-to-text type: amitpant7/nepali-speech-to-text split: validation (10% hold-out) metrics: - type: wer value: 31.64 name: WER (%) - type: cer value: 7.27 name: CER (%) pipeline_tag: automatic-speech-recognition library_name: transformers --- # whisper-small-nepali Fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the [amitpant7/nepali-speech-to-text](https://huggingface.co/datasets/amitpant7/nepali-speech-to-text) dataset for **Nepali automatic speech recognition**. Whisper-Small-Nepali is a fine-tuned automatic speech recognition (ASR) model based on OpenAI’s Whisper architecture, optimized for transcribing Nepali speech. --- ## πŸ“Š Evaluation Results | Metric | Value | |--------|-------| | **WER (%)** | 31.64 | | **CER (%)** | 7.27 | | **Eval Loss** | 0.2154 | | **Train Loss** | 1.0033 | | **Epochs** | 30 | | **Train time** | 26.5 min (RTX PRO 6000 Blackwell) | > Evaluated on a 10% held-out validation split of `amitpant7/nepali-speech-to-text` (seed=42). > Best checkpoint selected by lowest validation WER across all steps. --- ## πŸ“ˆ Training Curves ![Training Curves](training_curves.png) *Train loss, validation loss, WER, CER, and LR cosine schedule across all 1140 steps.* --- ## πŸš€ How to Use ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor import torch, librosa model = WhisperForConditionalGeneration.from_pretrained("devrahulbanjara/whisper-small-nepali") processor = WhisperProcessor.from_pretrained("devrahulbanjara/whisper-small-nepali") device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) # Load your audio (16 kHz mono) audio, sr = librosa.load("your_audio.wav", sr=16000) inputs = processor(audio, sampling_rate=16000, return_tensors="pt") input_features = inputs.input_features.to(device) with torch.no_grad(): predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(transcription) ``` --- ## πŸ—‚οΈ Dataset | Field | Value | |-------|-------| | **Dataset** | [amitpant7/nepali-speech-to-text](https://huggingface.co/datasets/amitpant7/nepali-speech-to-text) | | **Language** | Nepali (`ne`) | | **Train split** | 90% of dataset | | **Validation split** | 10% held-out (seed=42) | | **Credit** | [amitpant7](https://huggingface.co/amitpant7) | --- ## βš™οΈ Training Configuration | Hyperparameter | Value | |----------------|-------| | **Base model** | openai/whisper-small | | **Epochs** | 30 | | **Total steps** | 1140 | | **Batch size (per device)** | 32 | | **Gradient accumulation steps** | 2 | | **Effective batch size** | 64 | | **Learning rate** | 3e-5 | | **LR scheduler** | cosine (warmup + decay) | | **Warmup steps** | 150 | | **Weight decay** | 0.01 | | **Max grad norm** | 1.0 | | **Dropout** | 0.1 | | **Attention dropout** | 0.1 | | **Early stopping patience** | 5 evaluations | | **Early stopping threshold** | 0.05 WER improvement | | **Eval every** | 80 steps | | **Best model metric** | WER (lower is better) | | **Precision** | bf16 (Blackwell native) | | **Seed** | 42 | --- ## πŸ›‘οΈ Regularisation & Augmentation | Technique | Details | |-----------|---------| | **Weight decay (L2)** | Ξ»=0.01 on all non-bias, non-LayerNorm params | | **Dropout** | p=0.1 on encoder + decoder residual connections | | **Attention dropout** | p=0.1 inside multi-head attention layers | | **SpecAugment (time)** | 2Γ— time masks, max 30 consecutive frames each | | **SpecAugment (freq)** | 2Γ— frequency masks, max 15 mel-bins each | | **Gaussian noise** | Amplitude U[0.002, 0.010], applied with p=0.4 | | **Early stopping** | Patience=5 evals, min delta=0.05 WER | | **Gradient clipping** | Max L2 norm = 1.0 | --- ## πŸ—οΈ Training Infrastructure | Field | Value | |-------|-------| | **Hardware** | NVIDIA RTX PRO 6000 Blackwell Server Edition (24 physical / 48 logical cores) | | **Framework** | HuggingFace Transformers `Seq2SeqTrainer` | | **Experiment tracking** | Weights & Biases | | **Train samples/sec** | 44.97 | | **Total FLOPs** | 2.06 Γ— 10¹⁹ | | **Best model selection** | Lowest validation WER, restored via `load_best_model_at_end=True` | --- ## πŸ“„ License This model inherits the **Apache 2.0** license from [openai/whisper-small](https://huggingface.co/openai/whisper-small). The training dataset ([amitpant7/nepali-speech-to-text](https://huggingface.co/datasets/amitpant7/nepali-speech-to-text)) is credited to its original author β€” please review its own license before any commercial use. --- ## πŸ™ Acknowledgements - [OpenAI Whisper](https://github.com/openai/whisper) β€” base model architecture and weights - [amitpant7](https://huggingface.co/amitpant7) β€” Nepali speech-to-text dataset - [HuggingFace Transformers](https://github.com/huggingface/transformers) β€” training framework - [Weights & Biases](https://wandb.ai) β€” experiment tracking