Instructions to use mideind/kyutai-stt-1b-is-en with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mideind/kyutai-stt-1b-is-en with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="mideind/kyutai-stt-1b-is-en")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("mideind/kyutai-stt-1b-is-en") model = AutoModelForSpeechSeq2Seq.from_pretrained("mideind/kyutai-stt-1b-is-en") - Moshi
How to use mideind/kyutai-stt-1b-is-en with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "mideind/kyutai-stt-1b-is-en" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("mideind/kyutai-stt-1b-is-en") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
Kyutai STT 1B — Icelandic (fine-tune)
A streaming speech-to-text model fine-tuned from kyutai/stt-1b-en_fr for Icelandic.
The fine-tune extends the text vocabulary with Icelandic sub-words and adds two task-domain prompts so the same checkpoint can either transcribe Icelandic or translate Icelandic → English.
The model was trained with a 5 second delay, so when using it, be sure to append a 5-second delay to your audio input.
⚠️ Work in progress. Output quality is not production-grade. Icelandic transcription is intelligible but makes obvious word-level mistakes. English translation is fluent but loose (paraphrasing and the occasional hallucination).
Training data
Open-licensed Icelandic speech corpora paired with:
- Original Icelandic transcripts
- Machine-translated English transcripts. No human-translated parallel speech data was used.
Quick start
import torch
from transformers import (
KyutaiSpeechToTextForConditionalGeneration,
KyutaiSpeechToTextProcessor,
)
from transformers.generation import LogitsProcessor
DOMAIN_TOKENS = {
"asr_is_is": [9318, 8002, 8003, 9193], # <is> — Icelandic ASR
"asr_is_en": [9318, 8032, 8015, 9193], # <en> — Icelandic → English
}
class ForcePrefix(LogitsProcessor):
"""Force the first N tokens to the domain prefix, then force <pad> for
`pad_steps` more steps (the model's asr_delay window). Without the pad
window the model can prematurely emit an end-of-utterance token (\\n / .)
and stay in pad mode forever, producing no transcript.
"""
def __init__(self, prefix, prompt_len=1, pad_steps=12, pad_token_id=3):
self.prefix = prefix
self.prompt_len = prompt_len
self.pad_steps = pad_steps
self.pad_token_id = pad_token_id
def __call__(self, input_ids, scores):
i = input_ids.shape[1] - self.prompt_len
if 0 <= i < len(self.prefix):
scores[:] = float("-inf")
scores[:, self.prefix[i]] = 0.0
elif i < len(self.prefix) + self.pad_steps:
scores[:] = float("-inf")
scores[:, self.pad_token_id] = 0.0
return scores
device = (
"cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu"
)
model_id = "mideind/kyutai-stt-1b-is-en"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16
).to(device).eval()
# audio: 1-D float32 numpy array, 24 kHz mono, [-1, 1] range.
# IMPORTANT: append ~5 s of trailing silence so the model has time to flush
# its delayed text output past the asr_delay boundary.
import numpy as np
audio_with_silence = np.concatenate([audio, np.zeros(5 * 24000, dtype=np.float32)])
inputs = processor(audio=audio_with_silence, sampling_rate=24000)
inputs = {k: v.to(device) for k, v in inputs.items()}
domain = "asr_is_is" # or "asr_is_en"
with torch.no_grad():
out = model.generate(
**inputs,
logits_processor=[ForcePrefix(DOMAIN_TOKENS[domain])],
)
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)
Domains
The fine-tune supports two task-domain prompts. They must be injected as the first 4 generated text tokens (see ForcePrefix above). Without one of them, the model produces only padding tokens.
| domain | text-token sequence | task |
|---|---|---|
asr_is_is |
[9318, 8002, 8003, 9193] (decodes <is>) |
Icelandic speech → Icelandic text |
asr_is_en |
[9318, 8032, 8015, 9193] (decodes <en>) |
Icelandic speech → English text |
Architecture
Same architecture as kyutai/stt-1b-en_fr-trfs with an extended text vocabulary. Pure decoder-only transformer over text + audio codebooks, with a built-in Mimi audio codec (inlined into the same checkpoint).
| value | |
|---|---|
| Architecture | KyutaiSpeechToTextForConditionalGeneration |
| Hidden size | 2048 |
| Layers | 16 |
| Attention heads | 16 (MHA, head_dim 128) |
| FFN dim (fused gated SiLU) | 11264 (= 2 × 5632) |
| Position encoding | RoPE, θ = 100000 |
| Context (sliding window) | 375 frames @ 12.5 Hz = 30 s |
| Text vocab | 9323 (out vocab 9322 + 1 unused start slot) |
| Audio codebooks | 32 × 2049 (Mimi @ 12.5 Hz) |
| Audio sample rate | 24000 Hz, mono |
| ASR output delay | ~5 s |
| Parameters | ~1.07 B (LM) + ~0.13 B (codec) |
| Precision on disk | LM bf16, codec f32 |
Evaluation
Evaluated on 100 examples from a filtered subset of the Samrómur test split. Metrics are reported both raw (case- and punctuation-sensitive) and normalised (lowercased, punctuation stripped, whitespace squeezed).
| task | metric | raw | normalised |
|---|---|---|---|
asr_is_is — transcription |
WER ↓ | 13.43 % | 11.22 % |
asr_is_en — translation |
BLEU ↑ | 37.79 | 38.91 |
The translation BLEU looks very good on this in-domain sample, but the model does not generalise well outside the samrómur read-speech distribution — expect significant quality degradation on conversational, accented, noisy, or otherwise out-of-distribution audio.
Limitations
- Trailing silence is mandatory. The model emits text with an internal delay; without ~5 s of trailing silence after the last spoken word, the final words are never emitted.
- Domain prompt is mandatory. No prompt → only
<pad>tokens. - Forced pad window is recommended. The trained model can occasionally emit a premature end-of-utterance token (
\n/.) within its ASR-delay window and then refuse to emit any further text. Forcing<pad>for the first ~12 steps after the domain prefix avoids this (see the snippet above). - The English translation domain (
asr_is_en) produces farily fluent English but is not a faithful translator — paraphrasing/hallucination is expected. - Icelandic transcription quality: The model produces decent quality transcriptions but is not always "well behaved".
- Tested with
transformers==v5.9.0andtorch==2.11.0.
- Downloads last month
- 49