GigaAM v3 · MLX Runtime Bundle

Offline Russian speech recognition for Apple devices.
Native MLX inference on iPhone, iPad, and Mac — no Python, no cloud, no server required at runtime.

Converted from ai-sage/GigaAM-v3 (e2e_rnnt revision).
No additional training or fine-tuning. The native runtime produces identical decoded text to the Python reference on validated test inputs.


Platforms

Platform Target Status
iOS iPhone ✅ verified on iPhone 15 Pro Max
iPadOS iPad beta / needs verification
macOS Apple Silicon Mac ✅ verified

Quality

Evaluated on Apple M1 Max, release build.
Normalization: lowercase, punctuation removed.

Dataset Type Clips WER CER
Common Voice Scripted Speech 25.0 ru read speech 995 6.76% 2.55%
Common Voice Spontaneous Speech 3.0 ru spontaneous speech 276 9.86% 5.83%

Spontaneous Speech is a challenging scenario: conversational style, filler words, non-standard vocabulary. WER on read speech is expected to be lower — and the 6.76% result on Scripted Speech confirms this.


Performance · Apple M1 Max

Short audio (~6 s)

Runtime Total time RTF Speed
Native Apple MLX ~0.168 s ~0.028 ~35.8× realtime
Python reference ~0.701 s ~0.117 ~8.6× realtime

The native runtime is approximately 4.2× faster than the Python reference on the same hardware.

Long-form (911 s · 15 min 11 s)

Metric Value
Chunk size 20.0 s
Chunk count 46
Total transcription time 24.2 s
Real-time factor 0.027
Speed ~37.6× realtime
Peak resident memory ~1.15 GB
Stage breakdown (long-form)
Stage Time Share
Audio load 0.019 s ~0.1%
Mel frontend 5.993 s ~24.8%
Encoder 5.463 s ~22.6%
RNNT greedy decoding 12.736 s ~52.6%
— RNNT decoder 1.821 s ~7.5%
— RNNT joint 10.680 s ~44.1%
— RNNT readback 0.169 s ~0.7%
Tokenizer 0.003 s ~0.0%

The current main bottleneck is the RNNT joint network during greedy decoding.

Benchmark · 276 clips (CV Spontaneous Speech 3.0)

Mode Time / clip RTF avg RTF median Peak RSS
Cold (per-process) 1.87 s 0.16 0.12 992 MB
Warm (batch, model in memory) 0.49 s 0.05 0.03 981 MB

Cold — each clip starts a new process including model load (~1.4 s overhead).
Warm — model loaded once, all clips processed sequentially in a single process.

Memory

Scenario Peak RSS
Native MLX · short audio ~1.10 GB
Native MLX · long-form audio ~1.15 GB
Native MLX · benchmark (276 clips) ~992 MB
Python reference · short audio ~1.76 GB

Runtime Pipeline

Audio file / PCM samples
  → native audio loader
  → 16 kHz mono Float32 PCM
  → mel spectrogram frontend
  → Conformer encoder
  → RNNT greedy decoder
  → SentencePiece tokenizer
  → text

All preprocessing assets (hann_window, mel_filterbank) are bundled so native runtimes can reproduce the original pipeline exactly without any Python audio libraries.


Requirements

This bundle is designed for native Apple platform development.
It is not a Python package — there is nothing to pip install.

Native runtime (target)

Requirement Value
Platform iOS, iPadOS, and macOS — any version supported by MLX Swift
Architecture arm64 Apple devices
Framework Apple MLX
Language Swift
Runtime deps MLX Swift runtime and bundled model assets

Swift SDK and CLI: kruatech/gigaam-v3-mlx

To use this bundle in a native app, add the MLX Swift package to your Xcode project:

https://github.com/ml-explore/mlx-swift

Repository Files

README.md
.gitattributes
manifest.json
checksums.sha256
weights.fp16.safetensors
tokenizer.model
tokenizer_vocab.json
hann_window.f32.bin
mel_filterbank_mel_freq.f32.bin
File Description
weights.fp16.safetensors FP16 model weights (MLX-compatible)
tokenizer.model SentencePiece tokenizer model
tokenizer_vocab.json Vocabulary export for native tokenizer implementations
manifest.json Runtime manifest — model, frontend, and decoding metadata
hann_window.f32.bin Hann window for mel frontend
mel_filterbank_mel_freq.f32.bin Mel filterbank for mel frontend
checksums.sha256 SHA-256 checksums for integrity verification

Architecture

Component Value
Encoder type Conformer
Encoder layers 16
Model dimension 768
Attention heads 16
Attention type Rotary self-attention
Convolution kernel size 5
Subsampling Conv1D · factor 4
Prediction network RNNT predictor
Joint network RNNT joint
Decoding Greedy RNNT
Tokenizer SentencePiece
Vocabulary size 1024
Blank ID 1024
Output classes 1025
Precision FP16

Frontend Configuration

Parameter Value
Sample rate 16 000 Hz
Channels 1 (mono)
Mel bins 64
FFT size 320
Window length 320
Hop length 160
Center false
Mel scale HTK
Mel normalization none
Power 2.0

Effective feature hop: 10 ms before encoder subsampling.
Encoder subsampling factor: 4 → one encoder frame ≈ 40 ms of audio.


Long-Form Transcription

Long audio is processed in chunks to keep memory usage bounded and inference latency predictable.

Recommended settings:

chunk_seconds: 20
overlap_seconds: 0–2
sample_rate: 16000
channels: mono

Future runtimes may add VAD segmentation, overlap-aware merging, and timestamp-aware chunking for improved accuracy on long-form content.


Validation

The conversion was validated against the original PyTorch/Hugging Face model using tensor-level golden references at each stage of the pipeline.

Validated components:

  • Audio frontend · mel spectrogram
  • Pre-encoder
  • Conformer feed-forward blocks
  • Rotary self-attention
  • Conformer convolution block
  • Full Conformer layer · encoder stack
  • RNNT predictor · RNNT joint network
  • RNNT greedy decoding
  • SentencePiece tokenizer
  • Full WAV-to-text pipeline (end-to-end)

Selected numerical results:

Mel frontend
  feature_shape      [64, 99]
  max_abs_diff       0.0004234314
  mean_abs_diff      2.8040542e-05

Encoder stack
  stack_max_abs_diff 2.5629997e-06
  stack_mean_abs_diff 3.8420205e-07

Full encoder
  output_shape       [1, 768, 25]
  max_abs_diff       2.682209e-06
  mean_abs_diff      4.0401252e-07

End-to-end: the native runtime and the Python reference produce identical decoded text for the same input audio.


Limitations

  • Optimized for native Apple MLX runtimes; not intended for server or Python-based inference.
  • Long audio should be processed in chunks by the host runtime.
  • Word-level timestamps are not included in the bundle.
  • iPadOS not yet verified on device.
  • This repository contains model assets only — no application code, no Swift SDK source.

License

MIT — follows the license of the original ai-sage/GigaAM-v3 model.


Citation & Attribution

If you use this bundle, please also cite the original GigaAM model:

@misc{gigaam-v3,
  author       = {ai-sage},
  title        = {GigaAM-v3},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ai-sage/GigaAM-v3}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kruatech/gigaam-v3-mlx

Quantized
(7)
this model