Kyutai STT 1B — Icelandic (fine-tune)

A streaming speech-to-text model fine-tuned from kyutai/stt-1b-en_fr for Icelandic. The fine-tune extends the text vocabulary with Icelandic sub-words and adds two task-domain prompts so the same checkpoint can either transcribe Icelandic or translate Icelandic → English. The model was trained with a 5 second delay, so when using it, be sure to append a 5-second delay to your audio input.

⚠️ Work in progress. Output quality is not production-grade. Icelandic transcription is intelligible but makes obvious word-level mistakes. English translation is fluent but loose (paraphrasing and the occasional hallucination).

Training data

Open-licensed Icelandic speech corpora paired with:

  • Original Icelandic transcripts
  • Machine-translated English transcripts. No human-translated parallel speech data was used.

Quick start

import torch
from transformers import (
    KyutaiSpeechToTextForConditionalGeneration,
    KyutaiSpeechToTextProcessor,
)
from transformers.generation import LogitsProcessor

DOMAIN_TOKENS = {
    "asr_is_is": [9318, 8002, 8003, 9193],   # <is>  — Icelandic ASR
    "asr_is_en": [9318, 8032, 8015, 9193],   # <en>  — Icelandic → English
}


class ForcePrefix(LogitsProcessor):
    """Force the first N tokens to the domain prefix, then force <pad> for
    `pad_steps` more steps (the model's asr_delay window). Without the pad
    window the model can prematurely emit an end-of-utterance token (\\n / .)
    and stay in pad mode forever, producing no transcript.
    """
    def __init__(self, prefix, prompt_len=1, pad_steps=12, pad_token_id=3):
        self.prefix = prefix
        self.prompt_len = prompt_len
        self.pad_steps = pad_steps
        self.pad_token_id = pad_token_id

    def __call__(self, input_ids, scores):
        i = input_ids.shape[1] - self.prompt_len
        if 0 <= i < len(self.prefix):
            scores[:] = float("-inf")
            scores[:, self.prefix[i]] = 0.0
        elif i < len(self.prefix) + self.pad_steps:
            scores[:] = float("-inf")
            scores[:, self.pad_token_id] = 0.0
        return scores


device = (
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)

model_id = "mideind/kyutai-stt-1b-is-en"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16
).to(device).eval()

# audio: 1-D float32 numpy array, 24 kHz mono, [-1, 1] range.
# IMPORTANT: append ~5 s of trailing silence so the model has time to flush
# its delayed text output past the asr_delay boundary.
import numpy as np
audio_with_silence = np.concatenate([audio, np.zeros(5 * 24000, dtype=np.float32)])

inputs = processor(audio=audio_with_silence, sampling_rate=24000)
inputs = {k: v.to(device) for k, v in inputs.items()}

domain = "asr_is_is"  # or "asr_is_en"
with torch.no_grad():
    out = model.generate(
        **inputs,
        logits_processor=[ForcePrefix(DOMAIN_TOKENS[domain])],
    )

text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)

Domains

The fine-tune supports two task-domain prompts. They must be injected as the first 4 generated text tokens (see ForcePrefix above). Without one of them, the model produces only padding tokens.

domain text-token sequence task
asr_is_is [9318, 8002, 8003, 9193] (decodes <is>) Icelandic speech → Icelandic text
asr_is_en [9318, 8032, 8015, 9193] (decodes <en>) Icelandic speech → English text

Architecture

Same architecture as kyutai/stt-1b-en_fr-trfs with an extended text vocabulary. Pure decoder-only transformer over text + audio codebooks, with a built-in Mimi audio codec (inlined into the same checkpoint).

value
Architecture KyutaiSpeechToTextForConditionalGeneration
Hidden size 2048
Layers 16
Attention heads 16 (MHA, head_dim 128)
FFN dim (fused gated SiLU) 11264 (= 2 × 5632)
Position encoding RoPE, θ = 100000
Context (sliding window) 375 frames @ 12.5 Hz = 30 s
Text vocab 9323 (out vocab 9322 + 1 unused start slot)
Audio codebooks 32 × 2049 (Mimi @ 12.5 Hz)
Audio sample rate 24000 Hz, mono
ASR output delay ~5 s
Parameters ~1.07 B (LM) + ~0.13 B (codec)
Precision on disk LM bf16, codec f32

Evaluation

Evaluated on 100 examples from a filtered subset of the Samrómur test split. Metrics are reported both raw (case- and punctuation-sensitive) and normalised (lowercased, punctuation stripped, whitespace squeezed).

task metric raw normalised
asr_is_is — transcription WER ↓ 13.43 % 11.22 %
asr_is_en — translation BLEU ↑ 37.79 38.91

The translation BLEU looks very good on this in-domain sample, but the model does not generalise well outside the samrómur read-speech distribution — expect significant quality degradation on conversational, accented, noisy, or otherwise out-of-distribution audio.

Limitations

  • Trailing silence is mandatory. The model emits text with an internal delay; without ~5 s of trailing silence after the last spoken word, the final words are never emitted.
  • Domain prompt is mandatory. No prompt → only <pad> tokens.
  • Forced pad window is recommended. The trained model can occasionally emit a premature end-of-utterance token (\n / .) within its ASR-delay window and then refuse to emit any further text. Forcing <pad> for the first ~12 steps after the domain prefix avoids this (see the snippet above).
  • The English translation domain (asr_is_en) produces farily fluent English but is not a faithful translator — paraphrasing/hallucination is expected.
  • Icelandic transcription quality: The model produces decent quality transcriptions but is not always "well behaved".
  • Tested with transformers==v5.9.0 and torch==2.11.0.
Downloads last month
49
Safetensors
Model size
1B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support