---
language:
- ne
license: apache-2.0
base_model: openai/whisper-small
tags:
- whisper
- automatic-speech-recognition
- nepali
- fine-tuned
- audio
- asr
datasets:
- amitpant7/nepali-speech-to-text
metrics:
- wer
- cer
model-index:
- name: whisper-small-nepali
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: amitpant7/nepali-speech-to-text
      type: amitpant7/nepali-speech-to-text
      split: validation (10% hold-out)
    metrics:
    - type: wer
      value: 31.64
      name: WER (%)
    - type: cer
      value: 7.27
      name: CER (%)
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

# whisper-small-nepali

Fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the
[amitpant7/nepali-speech-to-text](https://huggingface.co/datasets/amitpant7/nepali-speech-to-text)
dataset for **Nepali automatic speech recognition**.

Whisper-Small-Nepali is a fine-tuned automatic speech recognition (ASR) model based on OpenAI’s Whisper architecture, optimized for transcribing Nepali speech.

---

## 📊 Evaluation Results

| Metric | Value |
|--------|-------|
| **WER (%)** | 31.64 |
| **CER (%)** | 7.27 |
| **Eval Loss** | 0.2154 |
| **Train Loss** | 1.0033 |
| **Epochs** | 30 |
| **Train time** | 26.5 min (RTX PRO 6000 Blackwell) |

> Evaluated on a 10% held-out validation split of `amitpant7/nepali-speech-to-text` (seed=42).
> Best checkpoint selected by lowest validation WER across all steps.

---

## 📈 Training Curves

![Training Curves](training_curves.png)

*Train loss, validation loss, WER, CER, and LR cosine schedule across all 1140 steps.*

---

## 🚀 How to Use

```python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch, librosa

model     = WhisperForConditionalGeneration.from_pretrained("devrahulbanjara/whisper-small-nepali")
processor = WhisperProcessor.from_pretrained("devrahulbanjara/whisper-small-nepali")
device    = "cuda" if torch.cuda.is_available() else "cpu"
model     = model.to(device)

# Load your audio (16 kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)

with torch.no_grad():
    predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
```

---

## 🗂️ Dataset

| Field | Value |
|-------|-------|
| **Dataset** | [amitpant7/nepali-speech-to-text](https://huggingface.co/datasets/amitpant7/nepali-speech-to-text) |
| **Language** | Nepali (`ne`) |
| **Train split** | 90% of dataset |
| **Validation split** | 10% held-out (seed=42) |
| **Credit** | [amitpant7](https://huggingface.co/amitpant7) |

---

## ⚙️ Training Configuration

| Hyperparameter | Value |
|----------------|-------|
| **Base model** | openai/whisper-small |
| **Epochs** | 30 |
| **Total steps** | 1140 |
| **Batch size (per device)** | 32 |
| **Gradient accumulation steps** | 2 |
| **Effective batch size** | 64 |
| **Learning rate** | 3e-5 |
| **LR scheduler** | cosine (warmup + decay) |
| **Warmup steps** | 150 |
| **Weight decay** | 0.01 |
| **Max grad norm** | 1.0 |
| **Dropout** | 0.1 |
| **Attention dropout** | 0.1 |
| **Early stopping patience** | 5 evaluations |
| **Early stopping threshold** | 0.05 WER improvement |
| **Eval every** | 80 steps |
| **Best model metric** | WER (lower is better) |
| **Precision** | bf16 (Blackwell native) |
| **Seed** | 42 |

---

## 🛡️ Regularisation & Augmentation

| Technique | Details |
|-----------|---------|
| **Weight decay (L2)** | λ=0.01 on all non-bias, non-LayerNorm params |
| **Dropout** | p=0.1 on encoder + decoder residual connections |
| **Attention dropout** | p=0.1 inside multi-head attention layers |
| **SpecAugment (time)** | 2× time masks, max 30 consecutive frames each |
| **SpecAugment (freq)** | 2× frequency masks, max 15 mel-bins each |
| **Gaussian noise** | Amplitude U[0.002, 0.010], applied with p=0.4 |
| **Early stopping** | Patience=5 evals, min delta=0.05 WER |
| **Gradient clipping** | Max L2 norm = 1.0 |

---

## 🏗️ Training Infrastructure

| Field | Value |
|-------|-------|
| **Hardware** | NVIDIA RTX PRO 6000 Blackwell Server Edition (24 physical / 48 logical cores) |
| **Framework** | HuggingFace Transformers `Seq2SeqTrainer` |
| **Experiment tracking** | Weights & Biases |
| **Train samples/sec** | 44.97 |
| **Total FLOPs** | 2.06 × 10¹⁹ |
| **Best model selection** | Lowest validation WER, restored via `load_best_model_at_end=True` |

---

## 📄 License

This model inherits the **Apache 2.0** license from [openai/whisper-small](https://huggingface.co/openai/whisper-small).
The training dataset ([amitpant7/nepali-speech-to-text](https://huggingface.co/datasets/amitpant7/nepali-speech-to-text))
is credited to its original author — please review its own license before any commercial use.

---

## 🙏 Acknowledgements

- [OpenAI Whisper](https://github.com/openai/whisper) — base model architecture and weights
- [amitpant7](https://huggingface.co/amitpant7) — Nepali speech-to-text dataset
- [HuggingFace Transformers](https://github.com/huggingface/transformers) — training framework
- [Weights & Biases](https://wandb.ai) — experiment tracking