Instructions to use kruatech/gigaam-v3-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use kruatech/gigaam-v3-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir gigaam-v3-mlx kruatech/gigaam-v3-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
GigaAM v3 · MLX Runtime Bundle
Offline Russian speech recognition for Apple devices.
Native MLX inference on iPhone, iPad, and Mac — no Python, no cloud, no server required at runtime.
Converted from ai-sage/GigaAM-v3 (e2e_rnnt revision).
No additional training or fine-tuning. The native runtime produces identical decoded text to the Python reference on validated test inputs.
Platforms
| Platform | Target | Status |
|---|---|---|
| iOS | iPhone | ✅ verified on iPhone 15 Pro Max |
| iPadOS | iPad | beta / needs verification |
| macOS | Apple Silicon Mac | ✅ verified |
Quality
Evaluated on Apple M1 Max, release build.
Normalization: lowercase, punctuation removed.
| Dataset | Type | Clips | WER | CER |
|---|---|---|---|---|
| Common Voice Scripted Speech 25.0 ru | read speech | 995 | 6.76% | 2.55% |
| Common Voice Spontaneous Speech 3.0 ru | spontaneous speech | 276 | 9.86% | 5.83% |
Spontaneous Speech is a challenging scenario: conversational style, filler words, non-standard vocabulary. WER on read speech is expected to be lower — and the 6.76% result on Scripted Speech confirms this.
Performance · Apple M1 Max
Short audio (~6 s)
| Runtime | Total time | RTF | Speed |
|---|---|---|---|
| Native Apple MLX | ~0.168 s | ~0.028 | ~35.8× realtime |
| Python reference | ~0.701 s | ~0.117 | ~8.6× realtime |
The native runtime is approximately 4.2× faster than the Python reference on the same hardware.
Long-form (911 s · 15 min 11 s)
| Metric | Value |
|---|---|
| Chunk size | 20.0 s |
| Chunk count | 46 |
| Total transcription time | 24.2 s |
| Real-time factor | 0.027 |
| Speed | ~37.6× realtime |
| Peak resident memory | ~1.15 GB |
Stage breakdown (long-form)
| Stage | Time | Share |
|---|---|---|
| Audio load | 0.019 s | ~0.1% |
| Mel frontend | 5.993 s | ~24.8% |
| Encoder | 5.463 s | ~22.6% |
| RNNT greedy decoding | 12.736 s | ~52.6% |
| — RNNT decoder | 1.821 s | ~7.5% |
| — RNNT joint | 10.680 s | ~44.1% |
| — RNNT readback | 0.169 s | ~0.7% |
| Tokenizer | 0.003 s | ~0.0% |
The current main bottleneck is the RNNT joint network during greedy decoding.
Benchmark · 276 clips (CV Spontaneous Speech 3.0)
| Mode | Time / clip | RTF avg | RTF median | Peak RSS |
|---|---|---|---|---|
| Cold (per-process) | 1.87 s | 0.16 | 0.12 | 992 MB |
| Warm (batch, model in memory) | 0.49 s | 0.05 | 0.03 | 981 MB |
Cold — each clip starts a new process including model load (~1.4 s overhead).
Warm — model loaded once, all clips processed sequentially in a single process.
Memory
| Scenario | Peak RSS |
|---|---|
| Native MLX · short audio | ~1.10 GB |
| Native MLX · long-form audio | ~1.15 GB |
| Native MLX · benchmark (276 clips) | ~992 MB |
| Python reference · short audio | ~1.76 GB |
Runtime Pipeline
Audio file / PCM samples
→ native audio loader
→ 16 kHz mono Float32 PCM
→ mel spectrogram frontend
→ Conformer encoder
→ RNNT greedy decoder
→ SentencePiece tokenizer
→ text
All preprocessing assets (hann_window, mel_filterbank) are bundled so native runtimes
can reproduce the original pipeline exactly without any Python audio libraries.
Requirements
This bundle is designed for native Apple platform development.
It is not a Python package — there is nothing to pip install.
Native runtime (target)
| Requirement | Value |
|---|---|
| Platform | iOS, iPadOS, and macOS — any version supported by MLX Swift |
| Architecture | arm64 Apple devices |
| Framework | Apple MLX |
| Language | Swift |
| Runtime deps | MLX Swift runtime and bundled model assets |
Swift SDK and CLI: kruatech/gigaam-v3-mlx
To use this bundle in a native app, add the MLX Swift package to your Xcode project:
https://github.com/ml-explore/mlx-swift
Repository Files
README.md
.gitattributes
manifest.json
checksums.sha256
weights.fp16.safetensors
tokenizer.model
tokenizer_vocab.json
hann_window.f32.bin
mel_filterbank_mel_freq.f32.bin
| File | Description |
|---|---|
weights.fp16.safetensors |
FP16 model weights (MLX-compatible) |
tokenizer.model |
SentencePiece tokenizer model |
tokenizer_vocab.json |
Vocabulary export for native tokenizer implementations |
manifest.json |
Runtime manifest — model, frontend, and decoding metadata |
hann_window.f32.bin |
Hann window for mel frontend |
mel_filterbank_mel_freq.f32.bin |
Mel filterbank for mel frontend |
checksums.sha256 |
SHA-256 checksums for integrity verification |
Architecture
| Component | Value |
|---|---|
| Encoder type | Conformer |
| Encoder layers | 16 |
| Model dimension | 768 |
| Attention heads | 16 |
| Attention type | Rotary self-attention |
| Convolution kernel size | 5 |
| Subsampling | Conv1D · factor 4 |
| Prediction network | RNNT predictor |
| Joint network | RNNT joint |
| Decoding | Greedy RNNT |
| Tokenizer | SentencePiece |
| Vocabulary size | 1024 |
| Blank ID | 1024 |
| Output classes | 1025 |
| Precision | FP16 |
Frontend Configuration
| Parameter | Value |
|---|---|
| Sample rate | 16 000 Hz |
| Channels | 1 (mono) |
| Mel bins | 64 |
| FFT size | 320 |
| Window length | 320 |
| Hop length | 160 |
| Center | false |
| Mel scale | HTK |
| Mel normalization | none |
| Power | 2.0 |
Effective feature hop: 10 ms before encoder subsampling.
Encoder subsampling factor: 4 → one encoder frame ≈ 40 ms of audio.
Long-Form Transcription
Long audio is processed in chunks to keep memory usage bounded and inference latency predictable.
Recommended settings:
chunk_seconds: 20
overlap_seconds: 0–2
sample_rate: 16000
channels: mono
Future runtimes may add VAD segmentation, overlap-aware merging, and timestamp-aware chunking for improved accuracy on long-form content.
Validation
The conversion was validated against the original PyTorch/Hugging Face model using tensor-level golden references at each stage of the pipeline.
Validated components:
- Audio frontend · mel spectrogram
- Pre-encoder
- Conformer feed-forward blocks
- Rotary self-attention
- Conformer convolution block
- Full Conformer layer · encoder stack
- RNNT predictor · RNNT joint network
- RNNT greedy decoding
- SentencePiece tokenizer
- Full WAV-to-text pipeline (end-to-end)
Selected numerical results:
Mel frontend
feature_shape [64, 99]
max_abs_diff 0.0004234314
mean_abs_diff 2.8040542e-05
Encoder stack
stack_max_abs_diff 2.5629997e-06
stack_mean_abs_diff 3.8420205e-07
Full encoder
output_shape [1, 768, 25]
max_abs_diff 2.682209e-06
mean_abs_diff 4.0401252e-07
End-to-end: the native runtime and the Python reference produce identical decoded text for the same input audio.
Limitations
- Optimized for native Apple MLX runtimes; not intended for server or Python-based inference.
- Long audio should be processed in chunks by the host runtime.
- Word-level timestamps are not included in the bundle.
- iPadOS not yet verified on device.
- This repository contains model assets only — no application code, no Swift SDK source.
License
MIT — follows the license of the original ai-sage/GigaAM-v3 model.
Citation & Attribution
If you use this bundle, please also cite the original GigaAM model:
@misc{gigaam-v3,
author = {ai-sage},
title = {GigaAM-v3},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ai-sage/GigaAM-v3}}
}
Quantized
Model tree for kruatech/gigaam-v3-mlx
Base model
ai-sage/GigaAM-v3