---
license: apache-2.0
base_model: nineninesix/kyrgyz-whisper-medium
library_name: peft
pipeline_tag: automatic-speech-recognition
language:
- ky
tags:
- whisper
- peft
- lora
- adapter
- automatic-speech-recognition
- speech
datasets:
- fsicoli/common_voice_22_0
metrics:
- wer

model-index:
  - name: AleksTv/whisper-medium-ky-lora
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 22.0 (ky)
          type: fsicoli/common_voice_22_0
          config: ky
          split: test
        metrics:
          - type: wer
            name: WER (normalized)
            value: 16.2061
          - type: wer
            name: WER (orthographic)
            value: 19.1491
---

# Kyrgyz Whisper Medium — LoRA Adapter (PEFT)

This repository contains a **LoRA/PEFT adapter** for Kyrgyz automatic speech recognition (ASR).

## Links

- Adapter (this repo): https://huggingface.co/AleksTv/whisper-medium-ky-lora
- Merged model (standalone, no PEFT needed): https://huggingface.co/AleksTv/whisper-medium-ky-merged
- Base model: https://huggingface.co/nineninesix/kyrgyz-whisper-medium
- Whisper paper: https://arxiv.org/abs/2212.04356
- Whisper Medium (architecture reference): https://huggingface.co/openai/whisper-medium

## What is this?

This repo provides **adapter weights only**. For inference, you must load the base model and then attach this adapter via PEFT.

If you want a single, standalone checkpoint, use the merged model linked above.

## Dataset

- Training/evaluation dataset: `fsicoli/common_voice_22_0` (config: `ky`)

## Results

Evaluation on Common Voice 22.0 Kyrgyz (`test` split):
- `WER` (normalized): **16.2061**
- `WER_ortho` (orthographic): **19.1491**
- `test_loss`: **0.1722**

Quick check (200 random test samples):
- `WER`: **16.1677**
- `WER_ortho`: **19.6021**

Note: WER depends on text normalization (punctuation/case), decoding settings, and audio preprocessing.

## Training details

LoRA fine-tuning summary:
- LoRA: `r=8`, `lora_alpha=16`, `lora_dropout=0.1`
- Target modules: `q_proj`, `v_proj`
- Steps: `max_steps=4000`
- Best checkpoint by WER: `checkpoint-4000` (WER=16.21)

Training progress (selected checkpoints):

| Step | Train loss | Val loss | WER_ortho | WER |
|---:|---:|---:|---:|---:|
| 500 | 0.7980 | 0.7911 | 44.3501 | 42.0754 |
| 1000 | 0.3980 | 0.2043 | 28.9947 | 27.8551 |
| 1500 | 0.1712 | 0.1821 | 20.7479 | 17.7343 |
| 2000 | 0.1734 | 0.1770 | 20.7569 | 17.6977 |
| 2500 | 0.1935 | 0.1743 | 19.7995 | 16.8192 |
| 3000 | 0.3406 | 0.1728 | 19.8988 | 16.9656 |
| 3500 | 0.3192 | 0.1724 | 19.3840 | 16.4074 |
| 4000 | 0.1499 | 0.1722 | 19.1491 | 16.2061 |

## How to use

### Install

```bash
pip install -U "transformers" "peft" "accelerate" "torch"
```

### Inference (Transformers pipeline + PEFT)

```python
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

adapter_id = "AleksTv/whisper-medium-ky-lora"

peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path  # nineninesix/kyrgyz-whisper-medium

device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
  base_id,
  torch_dtype=dtype,
  device_map="auto" if torch.cuda.is_available() else None,
  low_cpu_mem_usage=True,
  use_safetensors=True,
)

model = PeftModel.from_pretrained(base_model, adapter_id)

# The base model uses custom tokenizer components for Kyrgyz support.
processor = AutoProcessor.from_pretrained(base_id, trust_remote_code=True)

asr = pipeline(
  "automatic-speech-recognition",
  model=model,
  tokenizer=processor.tokenizer,
  feature_extractor=processor.feature_extractor,
  device=device,
)

print(asr("path/to/audio.wav")["text"])
```

### Merge adapter into the base model (standalone weights)

```python
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

adapter_id = "AleksTv/whisper-medium-ky-lora"

peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path

dtype = torch.float16 if torch.cuda.is_available() else torch.float32

base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
  base_id,
  torch_dtype=dtype,
  low_cpu_mem_usage=True,
  use_safetensors=True,
)

model = PeftModel.from_pretrained(base_model, adapter_id)
merged = model.merge_and_unload()

out_dir = "whisper-medium-ky-merged"
merged.save_pretrained(out_dir, safe_serialization=True)
AutoProcessor.from_pretrained(base_id, trust_remote_code=True).save_pretrained(out_dir)
```

## Limitations

- Quality may degrade on very noisy audio, far-field microphones, strong accents, code-switching, or long recordings without segmentation.
- For production, you typically want VAD/segmentation + post-processing.

## License

Apache-2.0.