--- license: apache-2.0 base_model: nineninesix/kyrgyz-whisper-medium library_name: peft pipeline_tag: automatic-speech-recognition language: - ky tags: - whisper - peft - lora - adapter - automatic-speech-recognition - speech datasets: - fsicoli/common_voice_22_0 metrics: - wer model-index: - name: AleksTv/whisper-medium-ky-lora results: - task: type: automatic-speech-recognition dataset: name: Common Voice 22.0 (ky) type: fsicoli/common_voice_22_0 config: ky split: test metrics: - type: wer name: WER (normalized) value: 16.2061 - type: wer name: WER (orthographic) value: 19.1491 --- # Kyrgyz Whisper Medium — LoRA Adapter (PEFT) This repository contains a **LoRA/PEFT adapter** for Kyrgyz automatic speech recognition (ASR). ## Links - Adapter (this repo): https://huggingface.co/AleksTv/whisper-medium-ky-lora - Merged model (standalone, no PEFT needed): https://huggingface.co/AleksTv/whisper-medium-ky-merged - Base model: https://huggingface.co/nineninesix/kyrgyz-whisper-medium - Whisper paper: https://arxiv.org/abs/2212.04356 - Whisper Medium (architecture reference): https://huggingface.co/openai/whisper-medium ## What is this? This repo provides **adapter weights only**. For inference, you must load the base model and then attach this adapter via PEFT. If you want a single, standalone checkpoint, use the merged model linked above. ## Dataset - Training/evaluation dataset: `fsicoli/common_voice_22_0` (config: `ky`) ## Results Evaluation on Common Voice 22.0 Kyrgyz (`test` split): - `WER` (normalized): **16.2061** - `WER_ortho` (orthographic): **19.1491** - `test_loss`: **0.1722** Quick check (200 random test samples): - `WER`: **16.1677** - `WER_ortho`: **19.6021** Note: WER depends on text normalization (punctuation/case), decoding settings, and audio preprocessing. ## Training details LoRA fine-tuning summary: - LoRA: `r=8`, `lora_alpha=16`, `lora_dropout=0.1` - Target modules: `q_proj`, `v_proj` - Steps: `max_steps=4000` - Best checkpoint by WER: `checkpoint-4000` (WER=16.21) Training progress (selected checkpoints): | Step | Train loss | Val loss | WER_ortho | WER | |---:|---:|---:|---:|---:| | 500 | 0.7980 | 0.7911 | 44.3501 | 42.0754 | | 1000 | 0.3980 | 0.2043 | 28.9947 | 27.8551 | | 1500 | 0.1712 | 0.1821 | 20.7479 | 17.7343 | | 2000 | 0.1734 | 0.1770 | 20.7569 | 17.6977 | | 2500 | 0.1935 | 0.1743 | 19.7995 | 16.8192 | | 3000 | 0.3406 | 0.1728 | 19.8988 | 16.9656 | | 3500 | 0.3192 | 0.1724 | 19.3840 | 16.4074 | | 4000 | 0.1499 | 0.1722 | 19.1491 | 16.2061 | ## How to use ### Install ```bash pip install -U "transformers" "peft" "accelerate" "torch" ``` ### Inference (Transformers pipeline + PEFT) ```python import torch from peft import PeftModel, PeftConfig from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline adapter_id = "AleksTv/whisper-medium-ky-lora" peft_cfg = PeftConfig.from_pretrained(adapter_id) base_id = peft_cfg.base_model_name_or_path # nineninesix/kyrgyz-whisper-medium device = 0 if torch.cuda.is_available() else -1 dtype = torch.float16 if torch.cuda.is_available() else torch.float32 base_model = AutoModelForSpeechSeq2Seq.from_pretrained( base_id, torch_dtype=dtype, device_map="auto" if torch.cuda.is_available() else None, low_cpu_mem_usage=True, use_safetensors=True, ) model = PeftModel.from_pretrained(base_model, adapter_id) # The base model uses custom tokenizer components for Kyrgyz support. processor = AutoProcessor.from_pretrained(base_id, trust_remote_code=True) asr = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device, ) print(asr("path/to/audio.wav")["text"]) ``` ### Merge adapter into the base model (standalone weights) ```python import torch from peft import PeftModel, PeftConfig from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor adapter_id = "AleksTv/whisper-medium-ky-lora" peft_cfg = PeftConfig.from_pretrained(adapter_id) base_id = peft_cfg.base_model_name_or_path dtype = torch.float16 if torch.cuda.is_available() else torch.float32 base_model = AutoModelForSpeechSeq2Seq.from_pretrained( base_id, torch_dtype=dtype, low_cpu_mem_usage=True, use_safetensors=True, ) model = PeftModel.from_pretrained(base_model, adapter_id) merged = model.merge_and_unload() out_dir = "whisper-medium-ky-merged" merged.save_pretrained(out_dir, safe_serialization=True) AutoProcessor.from_pretrained(base_id, trust_remote_code=True).save_pretrained(out_dir) ``` ## Limitations - Quality may degrade on very noisy audio, far-field microphones, strong accents, code-switching, or long recordings without segmentation. - For production, you typically want VAD/segmentation + post-processing. ## License Apache-2.0.