Instructions to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF", filename="voxtral-realtime-4b-mmproj-f16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16 # Run inference directly in the terminal: llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16 # Run inference directly in the terminal: llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
Use Docker
docker model run hf.co/acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
- LM Studio
- Jan
- Ollama
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Ollama:
ollama run hf.co/acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
- Unsloth Studio new
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF to start chatting
- Docker Model Runner
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Docker Model Runner:
docker model run hf.co/acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
- Lemonade
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
Run and chat with the model
lemonade run user.Voxtral-Mini-4B-Realtime-2602_GGUF-F16
List all available models
lemonade list
This repo contains a llama.cpp compatible implementation of https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602
1. Architecture Overview
Voxtral Realtime 4B (model ID: Voxtral-Mini-4B-Realtime-2602) is a dual-stream multimodal speech-to-text model from Mistral AI. Unlike conventional encoder-decoder ASR models, it uses a unique streaming protocol where audio and text embeddings are combined at every position during inference.
Dual-Stream Inference Protocol
At each position pos, the decoder input is:
input[pos] = audio_embed[pos] + text_embed[token_at_pos]
- Prefix phase (positions 0..38):
token_at_pos = prompt_ids[pos]prompt_ids = [BOS] + [STREAMING_PAD] * 38- 32 left-pad tokens + 6 delay tokens (480ms transcription delay)
- Autoregressive phase (positions 39..n_audio-1):
token_at_pos = previously_generated_token
This is fundamentally different from standard multimodal models where audio/image embeddings are simply prepended to the text sequence.
2. Model Components
2.1 Mel Spectrogram Preprocessor
| Parameter | Value |
|---|---|
| Sample rate | 16,000 Hz |
| FFT size (n_fft) | 400 |
| Hop length | 160 |
| Window length | 400 (Hann, periodic) |
| Mel bins | 128 |
| Mel scale | Slaney (HTK below 1kHz, log above) |
| Normalization | Fixed max = 1.5 (GLOBAL_LOG_MEL_MAX) |
Key difference from Whisper: The normalization uses a fixed maximum value of 1.5 rather than a data-dependent maximum. The formula is:
log_spec = log10(max(mel_spec, 1e-10))
log_spec = max(log_spec, 1.5 - 8.0) # clamp floor at -6.5
log_spec = (log_spec + 4.0) / 4.0 # normalize to ~[0, 1]
Streaming padding: Before mel computation, the raw audio is padded:
- Left: 32 * 1280 = 40,960 zeros (32 left-pad tokens)
- Right: alignment + 17 * 1280 = 21,760 zeros (right-pad tokens)
2.2 Causal Audio Encoder
The encoder is a 32-layer causal transformer with the following architecture:
| Parameter | Value |
|---|---|
| Dimension (n_embd) | 1,280 |
| Layers | 32 |
| Attention heads | 32 |
| Head dimension | 64 (note: 64 != 1280/32 = 40) |
| KV heads | 32 (no GQA) |
| FFN hidden dim | 5,120 |
| Sliding window | 750 |
| RoPE theta | 1,000,000 |
| Norm epsilon | 1e-5 |
| Activation | SwiGLU (SiLU gate) |
| Normalization | RMSNorm |
2.2.1 Causal Conv1d Stem
Two convolutional layers with causal (left-only) padding:
- Conv0: kernel=3, stride=1, in=128, out=1280
- Left-pad by 2 (= kernel_size - stride)
- Followed by GELU activation
- Conv1: kernel=3, stride=2, in=1280, out=1280
- Left-pad by 1 (= kernel_size - stride)
- Plus alignment padding on the right
- Followed by GELU activation
- 2x temporal downsampling
After convolution, the sequence is left-truncated to a multiple of the stack factor (4).
2.2.2 Attention with Sliding Window
Each attention layer uses:
- Q, K, V projections: Q and V have bias; K has no bias
- RoPE: Interleaved style (mode=0), theta=1M, applied to Q and K
- Causal + sliding window mask: Position j can attend to position i only if
i <= jANDi >= j - 749 - Scale:
1/sqrt(64)(using actual head_dim, not n_embd/n_head)
2.2.3 SwiGLU FFN
gate = SiLU(x @ W1)
up = x @ W3
down = (gate * up) @ W2 + bias
Where W1 is the gate projection, W3 is the up projection, and W2 is the down projection (with bias).
2.3 Frame Stacking and Adapter
After the encoder, frames are stacked 4x:
- Input: [seq_len, 1280]
- After stacking: [seq_len/4, 5120]
- After adapter MLP: [seq_len/4, 3072]
The adapter is: Linear(5120, 9216) -> GELU -> Linear(9216, 3072) (no bias).
2.4 Text Decoder (Modified Llama)
The decoder is a standard 26-layer Llama model with one key addition:
2.4.1 Adaptive RMSNorm (ada_rms_norm_t_cond)
Each decoder layer has an adaptive normalization module that conditions the FFN on a time embedding derived from the transcription delay:
t_cond = sinusoidal_embedding(n_delay_tokens, dim=3072)
# Per layer:
ada_hidden = GELU(t_cond @ ada_down.T) # [3072] -> [32]
ada_scale = ada_hidden @ ada_up.T # [32] -> [3072]
ffn_input = ffn_norm(x) * (1 + ada_scale)
For the GGUF implementation, this is precomputed at conversion time for a fixed delay of 480ms (6 delay tokens).
The precomputed (1 + ada_scale) vector is stored as blk.{i}.ffn_ada_norm_up.weight and applied as a simple element-wise multiplication after RMSNorm.
2.4.2 Tied Embeddings
The output projection (lm_head) shares weights with the input embedding table (tok_embeddings.weight). The embedding table is stored at mm_streams_embeddings.embedding_module.tok_embeddings.weight in the original safetensors.
| Parameter | Value |
|---|---|
| Dimension | 3,072 |
| Layers | 26 |
| Attention heads | 32 |
| KV heads | 8 (GQA ratio = 4) |
| Head dimension | 128 |
| FFN hidden dim | 9,216 |
| Sliding window | 8,192 |
| RoPE theta | 1,000,000 |
| Vocab size | 131,072 |
| Ada norm dim | 32 |
3. GGUF Conversion
3.1 Conversion Pipeline
The model is converted from Mistral's safetensors format to two GGUF files:
- Text model (
voxtral-realtime-4b-text-f16.gguf): Llama decoder + ada_norm tensors - Multimodal projector (
voxtral-realtime-4b-mmproj-f16.gguf): Encoder + adapter
3.2 Key Tensor Mappings
Encoder (mmproj GGUF)
| Safetensors Name | GGUF Name |
|---|---|
whisper_encoder.conv_layers.0.conv.weight |
v.enc_conv1d.1.weight |
whisper_encoder.conv_layers.0.conv.bias |
v.enc_conv1d.1.bias |
whisper_encoder.conv_layers.1.conv.weight |
v.enc_conv1d.2.weight |
whisper_encoder.conv_layers.1.conv.bias |
v.enc_conv1d.2.bias |
whisper_encoder.transformer.layers.{i}.attention.wq.weight |
v.blk.{i}.attn_q.weight |
whisper_encoder.transformer.layers.{i}.attention.wq.bias |
v.blk.{i}.attn_q.bias |
whisper_encoder.transformer.layers.{i}.attention.wk.weight |
v.blk.{i}.attn_k.weight |
whisper_encoder.transformer.layers.{i}.attention.wv.weight |
v.blk.{i}.attn_v.weight |
whisper_encoder.transformer.layers.{i}.attention.wv.bias |
v.blk.{i}.attn_v.bias |
whisper_encoder.transformer.layers.{i}.attention.wo.weight |
v.blk.{i}.attn_out.weight |
whisper_encoder.transformer.layers.{i}.attention.wo.bias |
v.blk.{i}.attn_out.bias |
whisper_encoder.transformer.layers.{i}.feed_forward.w1.weight |
v.blk.{i}.ffn_gate.weight |
whisper_encoder.transformer.layers.{i}.feed_forward.w2.weight |
v.blk.{i}.ffn_down.weight |
whisper_encoder.transformer.layers.{i}.feed_forward.w2.bias |
v.blk.{i}.ffn_down.bias |
whisper_encoder.transformer.layers.{i}.feed_forward.w3.weight |
v.blk.{i}.ffn_up.weight |
audio_language_projection.0.weight |
v.mm_audio_mlp.1.weight |
audio_language_projection.2.weight |
v.mm_audio_mlp.2.weight |
Decoder (text GGUF)
| Safetensors Name | GGUF Name |
|---|---|
tok_embeddings.weight |
token_embd.weight (tied with output) |
layers.{i}.ada_rms_norm_t_cond.0.weight |
(consumed during conversion) |
layers.{i}.ada_rms_norm_t_cond.2.weight |
blk.{i}.ffn_ada_norm_up.weight (precomputed) |
3.3 Ada Norm Precomputation
During GGUF conversion, the ada_rms_norm_t_cond weights are precomputed:
# For each layer:
t_cond = sinusoidal_embedding(6, dim=3072) # fixed 480ms delay
ada_hidden = GELU(linear(t_cond, ada_down)) # [3072] -> [32]
ada_scale = linear(ada_hidden, ada_up) # [32] -> [3072]
precomputed = (1.0 + ada_scale) # [3072]
# Stored as blk.{i}.ffn_ada_norm_up.weight
This eliminates the need for runtime time embedding computation and reduces the ada_norm to a single element-wise multiply.
4. Implementation Details
4.1 Files Modified/Created
New Files
| File | Purpose |
|---|---|
tools/mtmd/models/voxtral-realtime-enc.cpp |
Causal audio encoder graph builder |
tools/mtmd/voxtral-stream.h |
Streaming audio infrastructure (ring buffer, mic capture, mel) |
tools/mtmd/voxtral-stream.cpp |
Streaming infrastructure implementation |
tools/mtmd/voxtral-stream-cli.cpp |
Dedicated CLI with dual-stream inference protocol |
Modified Files
| File | Changes |
|---|---|
tools/mtmd/clip.cpp |
Added PROJECTOR_TYPE_VOXTRAL_REALTIME graph dispatch, input setup (positions, sliding window mask), patch count computation |
tools/mtmd/mtmd-audio.h |
Added mtmd_audio_preprocessor_voxtral_rt class |
tools/mtmd/mtmd-audio.cpp |
Implemented Voxtral RT preprocessor (center padding, streaming padding, fixed-max normalization); added fixed_max to filter_params |
tools/mtmd/mtmd.cpp |
Route PROJECTOR_TYPE_VOXTRAL_REALTIME to Voxtral RT preprocessor |
tools/mtmd/CMakeLists.txt |
Added llama-voxtral-stream build target |
src/models/llama.cpp |
Added adaptive RMSNorm application in FFN block |
src/llama-arch.h |
Added LLM_TENSOR_FFN_ADA_NORM_DOWN/UP enum values |
src/llama-arch.cpp |
Added tensor name mappings and info entries |
src/llama-model.h |
Added ffn_ada_norm_down/up to llama_layer struct |
src/llama-model.cpp |
Load ada_norm tensors for LLM_ARCH_LLAMA |
convert_hf_to_gguf.py |
Tied embeddings fix, ada_norm precomputation, Mistral format support |
gguf-py/gguf/constants.py |
Added FFN_ADA_NORM_DOWN/UP model tensor enums |
gguf-py/gguf/tensor_mapping.py |
Added ada_norm tensor name mappings |
4.2 Encoder Graph (voxtral-realtime-enc.cpp)
The encoder graph is built as a single ggml_cgraph with 1048 nodes:
- Causal Conv1d: Uses
ggml_pad_extfor asymmetric left-padding +ggml_conv_1dwith zero padding - RoPE:
ggml_rope_extwith mode=0 (interleaved), theta=1M - Sliding Window Attention: Pre-computed
[seq_len, seq_len]mask passed toggml_soft_max_ext - SwiGLU FFN:
build_ffnwithFFN_SILUand gate tensor - Frame Stacking:
build_stackwith factor=4 - Adapter:
build_ffnwithFFN_GELU_ERF
4.3 Decoder Modifications (src/models/llama.cpp)
A single conditional block was added to the Llama FFN:
// After ffn_norm, before build_ffn:
if (model.layers[il].ffn_ada_norm_up) {
cur = ggml_mul(ctx0, cur, model.layers[il].ffn_ada_norm_up);
}
4.4 Audio Preprocessor (mtmd-audio.cpp)
The mtmd_audio_preprocessor_voxtral_rt class:
- Applies streaming padding (left=40,960, right=alignment+21,760 zeros)
- Computes mel spectrogram with
center_padding=true(matchingtorch.stft) - Uses
fixed_max=1.5for normalization (not data-dependent) - Drops first frame if mel length is odd
- Outputs a single mel chunk (no 3000-frame splitting)
4.5 Dual-Stream CLI (voxtral-stream-cli.cpp)
The CLI implements the full Voxtral Realtime inference protocol:
- Token Embedding Table: Loaded from the text model GGUF file at startup using
gguf_init_from_file. Supports F32, F16, and quantized types viaggml_get_type_traits. - Prefill: Combines audio embeddings with prompt token embeddings (
BOS + 38*STREAMING_PAD) and feeds them asllama_batch.embd. - Autoregressive Generation: At each step, combines
audio_embed[pos] + text_embed[prev_token]and feeds as embedding. - Streaming Markers: Filters
[STREAMING_PAD]and[STREAMING_WORD]tokens from output.
5. Numerical Parity Validation
5.1 Methodology
Numerical parity was validated by comparing the C++ llama.cpp implementation against a standalone Python reference implementation (reference_inference.py) that uses PyTorch + safetensors directly (no vLLM or transformers dependency).
5.2 Component-by-Component Validation
5.2.1 Mel Spectrogram
| Metric | Value |
|---|---|
| Max absolute difference | 0.00002 |
| Mean absolute difference | 0.0000001 |
| Frames with diff > 0.001 | 0 |
Method: Created parity_check2.py that computes mel using both torch.stft (Python reference) and a NumPy simulation of the C++ log_mel_spectrogram function with center_padding=true and fixed_max=1.5. The mel filterbanks are identical (max diff = 0.0). The tiny remaining difference comes from floating-point FFT implementation differences.
5.2.2 Causal Conv1d
Validated by: Ensuring the Python reference causal_conv1d function (left-pad only) matches the C++ ggml_pad_ext + ggml_conv_1d implementation. The output sequence lengths match exactly.
Key fix: The initial implementation used ggml_conv_1d_ph (symmetric half-padding), which was replaced with ggml_pad_ext (asymmetric left-only padding) to match the causal behavior.
5.2.3 RoPE
Validated by: The Python reference uses interleaved RoPE:
x1 = x[..., ::2] # even indices
x2 = x[..., 1::2] # odd indices
o1 = x1 * cos - x2 * sin
o2 = x2 * cos + x1 * sin
out = stack([o1, o2], dim=-1).flatten(-2)
The C++ uses ggml_rope_ext with mode=0 which implements the same interleaved pattern.
5.2.4 Sliding Window Attention
Validated by: The Python reference uses:
attn_mask = (kv_abs <= qi_abs) & (kv_abs >= (qi_abs - (window - 1)))
The C++ uses a pre-computed mask tensor filled with:
mask[i][j] = (j <= i && j >= i - 749) ? 0.0f : -INFINITY;
Both implement the same causal + sliding window pattern with window=750.
5.2.5 Adaptive RMSNorm
Validated by: The precomputed approach was verified by comparing the precomputed (1 + ada_scale) vector against the runtime computation in the Python reference. Since the delay is fixed at 480ms (6 tokens), the precomputed values are exact.
5.2.6 Full Pipeline (End-to-End)
| Test | Python Reference | C++ (llama.cpp) | Match |
|---|---|---|---|
| test01_20s.wav (24s speech) | "Dancing in the masquerade, idle truth in plain sight jaded, pop, roll, click, shot, who will I be today or not? But such a tide as moving seems asleep, too full for sound and foam, when that which drew from out the boundless deep turns again home, twilight and evening bell, and after that," | Identical | Word-for-word |
The C++ implementation produces word-for-word identical output to the Python reference on the test audio file. This confirms numerical parity across all components: mel spectrogram, encoder, adapter, and decoder (including ada_norm and the dual-stream protocol).
5.3 Parity Test Scripts
| Script | Purpose |
|---|---|
debug_encoder.py |
Layer-0 forward pass trace for NaN debugging |
reference_inference.py |
Full standalone Python inference pipeline |
parity_check.py |
Mel spectrogram comparison (Python vs Whisper-style) |
parity_check2.py |
Mel spectrogram comparison (Python vs C++ center-padded) |
benchmark_hf.py |
HF Python performance benchmark |
6. Streaming Infrastructure
6.1 Components
The streaming infrastructure (voxtral-stream.h/.cpp) provides cross-platform real-time audio capture and processing:
6.1.1 Ring Buffer (voxtral_ring_buffer)
- Lock-free circular buffer using
std::atomicfor thread-safe producer-consumer - Used between the audio capture callback and the main processing thread
6.1.2 Microphone Capture (voxtral_mic_capture)
- Uses miniaudio (already vendored in llama.cpp) for cross-platform audio input
- Supports Windows, macOS, and Linux
- Configurable sample rate and buffer duration
- Audio callback writes to ring buffer
6.1.3 Streaming Mel Preprocessor (voxtral_streaming_mel)
- Incremental mel spectrogram computation from raw PCM samples
- Self-contained FFT and mel filterbank cache
- Produces
voxtral_mel_chunkstructures compatible with the mtmd pipeline
6.2 CLI Modes
The llama-voxtral-stream CLI supports two modes:
- Microphone mode (default): Captures audio until Ctrl+C, then transcribes
- File mode (
--image <file.wav>): Loads and transcribes a WAV file
7. Performance Benchmarks
7.1 Test Configuration
| Parameter | Value |
|---|---|
| Test file | test01_20s.wav (24.0 seconds, 16kHz mono) |
| Audio tokens | 349 |
| Generated tokens | 311 |
| Hardware | CPU only (no GPU offload) |
| Precision | F16 weights, F32 compute |
| Platform | Windows 10, MSVC 19.44 |
7.2 Latency Comparison
| Stage | Python HF (ms) | C++ llama.cpp (ms) | Speedup |
|---|---|---|---|
| Mel spectrogram | 9.6 | (included in encode) | - |
| Encoder (32 layers) | 4,766 | 9,352 (encode total) | 0.5x |
| Adapter | 38 | (included in encode) | - |
| Decoder prefill (39 tok) | 469 | 605 | 0.8x |
| Decoder generation (310 steps) | 76,230 | 31,109 | 2.4x |
| Total inference | 81,514 | 42,709 | 1.9x |
| Metric | Python HF | C++ llama.cpp |
|---|---|---|
| Per-token generation | 245.9 ms | 100.4 ms |
| Real-time factor (RTF) | 3.40x | 1.78x |
| Tokens per second (generation) | 4.1 tok/s | 10.0 tok/s |
7.3 Memory Comparison
| Metric | Python HF | C++ llama.cpp (ctx=2048) | C++ llama.cpp (ctx=131072) |
|---|---|---|---|
| Baseline | 417 MB | - | - |
| After encoder | 2,636 MB | - | - |
| After decoder load | 22,153 MB | - | - |
| Peak memory | 22,300 MB | 10,607 MB | 23,718 MB |
| Delta from baseline | +21,883 MB | - | - |
Key observations:
- The Python HF reference loads all 26 decoder layers into memory as F32 tensors, consuming ~21 GB
- The C++ implementation with
ctx=2048uses only 10.6 GB (2.1x less than Python) - The C++ default context (131072) allocates a 13 GB KV cache, which inflates memory to 23.7 GB
- With a practical context size (2048), the C++ implementation is significantly more memory-efficient
7.4 Memory Breakdown (C++, ctx=2048)
| Component | Size |
|---|---|
| Text model (F16) | 6,541 MB |
| Encoder model (F16) | 1,899 MB |
| KV cache (2048 cells) | 208 MB |
| Compute buffer | 268 MB |
| Encoder compute buffer | 346 MB |
| Token embedding table (F32) | ~1,536 MB |
| Other overhead | ~100 MB |
| Total | ~10,607 MB |
7.5 Summary
| Metric | Python HF | C++ llama.cpp | Winner |
|---|---|---|---|
| Total latency | 81.5s | 42.7s | C++ (1.9x faster) |
| Generation speed | 4.1 tok/s | 10.0 tok/s | C++ (2.4x faster) |
| Peak memory (practical) | 22.3 GB | 10.6 GB | C++ (2.1x less) |
| Real-time factor | 3.40x | 1.78x | C++ (closer to real-time) |
| Output quality | Reference | Word-for-word identical | Tie |
& "llama-voxtral-stream.exe" -m "voxtral-realtime-4b-text-f16.gguf" --mmproj "voxtral-realtime-4b-mmproj-f16.gguf" --image "test01_20s.wav" -n 100 -t 8 --no-mmproj-offload 2>&1
9. File Manifest
New Files Created
tools/mtmd/models/voxtral-realtime-enc.cpp - Causal audio encoder graph
tools/mtmd/voxtral-stream.h - Streaming infrastructure header
tools/mtmd/voxtral-stream.cpp - Streaming infrastructure implementation
tools/mtmd/voxtral-stream-cli.cpp - Dual-stream inference CLI
Modified llama.cpp Core Files
src/llama-arch.h - LLM_TENSOR_FFN_ADA_NORM_DOWN/UP enums
src/llama-arch.cpp - Tensor name mappings, info entries
src/llama-model.h - ffn_ada_norm_down/up in llama_layer
src/llama-model.cpp - Load ada_norm tensors
src/models/llama.cpp - Apply ada_norm in FFN block
Modified Multimodal Files
tools/mtmd/clip.cpp - Voxtral RT graph dispatch, input setup, patch count
tools/mtmd/mtmd-audio.h - Voxtral RT preprocessor class
tools/mtmd/mtmd-audio.cpp - Voxtral RT preprocessor, fixed_max normalization
tools/mtmd/mtmd.cpp - Route to Voxtral RT preprocessor
tools/mtmd/CMakeLists.txt - Build target for llama-voxtral-stream
Modified Conversion Files
convert_hf_to_gguf.py - Tied embeddings, ada_norm precomputation
gguf-py/gguf/constants.py - FFN_ADA_NORM_DOWN/UP tensor enums
gguf-py/gguf/tensor_mapping.py - Ada norm tensor name mappings
Test and Benchmark Scripts
reference_inference.py - Standalone Python reference implementation
debug_encoder.py - Encoder layer-0 trace for NaN debugging
parity_check.py - Mel spectrogram comparison v1
parity_check2.py - Mel spectrogram comparison v2 (center-padded)
benchmark_hf.py - Python HF performance benchmark
10. CUDA GPU Acceleration
Build with CUDA
# Reconfigure with CUDA enabled
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
# Build all targets (includes CUDA kernel compilation, ~13 min first time)
cmake --build build --config Release
# Or build only the voxtral streaming tool
cmake --build build --config Release --target llama-voxtral-stream
Model Quantization
Three text model variants were produced from the original Mistral-format safetensors:
| Model | File | Size | Description |
|---|---|---|---|
| F32 | voxtral-realtime-4b-text-f32.gguf |
13,089 MB | Full precision, highest accuracy |
| F16 | voxtral-realtime-4b-text-f16.gguf |
6,549 MB | Half precision (original conversion) |
| Q8_0 | voxtral-realtime-4b-text-q8_0.gguf |
3,483 MB | 8-bit quantized, best speed/quality |
The audio encoder (mmproj) remains at F16 (1,900 MB) for all configurations.
Quantization commands:
# F16 -> F32 (for reference/parity testing)
llama-quantize --allow-requantize voxtral-realtime-4b-text-f16.gguf voxtral-realtime-4b-text-f32.gguf F32 8
# F16 -> Q8_0 (recommended for production)
llama-quantize voxtral-realtime-4b-text-f16.gguf voxtral-realtime-4b-text-q8_0.gguf Q8_0 8
GPU Performance Results
Test: test01_20s.wav (24.0 seconds of audio), 100 tokens generated, RTX 5090 Laptop GPU, 8 threads.
| Config | Encode | Decode | Total | tok/s | RTF | Speedup vs CPU |
|---|---|---|---|---|---|---|
| CPU F16 (baseline) | 12,497 ms | 10,370 ms | 25,061 ms | 9.5 | 1.00x | 1.0x |
GPU F32 -ngl 99 |
286 ms | 2,893 ms | 3,179 ms | 34.6 | 0.13x | 3.6x decode |
GPU F16 -ngl 99 |
460 ms | 1,522 ms | 1,982 ms | 65.7 | 0.08x | 6.8x decode |
GPU Q8_0 -ngl 99 |
278 ms | 1,047 ms | 1,325 ms | 95.5 | 0.06x | 10x decode |
Key observations:
- Audio encoder: 12.5s -> 278ms (45x faster on GPU)
- Text decode: 9.5 -> 95.5 tok/s (10x faster with Q8_0)
- RTF 0.06x: Processes 24s of audio in 1.3s (16x faster than real-time)
- Q8_0 transcription quality: Identical words, minor capitalization differences ("masquerade" vs "Masquerade")
- F32 is slower than F16: Expected, as F32 uses 2x memory bandwidth with no quality benefit for inference
Run Commands
File transcription (recommended: Q8_0 on GPU):
llama-voxtral-stream.exe ^
-m voxtral-realtime-4b-text-q8_0.gguf ^
--mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
--image test01_20s.wav ^
-n 500 -t 8 -ngl 99 -c 2048
File transcription (F32 on GPU, for parity testing):
llama-voxtral-stream.exe ^
-m voxtral-realtime-4b-text-f32.gguf ^
--mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
--image test01_20s.wav ^
-n 500 -t 8 -ngl 99 -c 2048
File transcription (F16 on GPU):
llama-voxtral-stream.exe ^
-m voxtral-realtime-4b-text-f16.gguf ^
--mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
--image test01_20s.wav ^
-n 500 -t 8 -ngl 99 -c 2048
Live microphone transcription (Q8_0 on GPU):
llama-voxtral-stream.exe -m voxtral-realtime-4b-text-q8_0.gguf --mmproj voxtral-realtime-4b-mmproj-f16.gguf -n 500 -t 8 -ngl 99 -c 2048
CPU-only mode (no GPU, for comparison):
llama-voxtral-stream.exe ^
-m voxtral-realtime-4b-text-f16.gguf ^
--mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
--image test01_20s.wav ^
-n 500 -t 8 --no-mmproj-offload -ngl 0 -c 2048
CLI Flags Reference
| Flag | Description |
|---|---|
-m <path> |
Text model GGUF file (required) |
--mmproj <path> |
Audio encoder GGUF file (required) |
--image <path> |
Audio file to transcribe (omit for live mic) |
-n <N> |
Max tokens to generate per chunk (default: 500) |
-t <N> |
CPU threads (default: 4) |
-ngl <N> |
GPU layers to offload (99 or all for full GPU) |
-c <N> |
Context size (default: 2048, sufficient for streaming) |
--no-mmproj-offload |
Force audio encoder to CPU |
--verbose-prompt |
Show full model-loading and debug output |
VRAM Usage
| Config | Text Model | Encoder | KV Cache | Total VRAM |
|---|---|---|---|---|
| GPU F32 | ~13.1 GB | ~1.9 GB | ~0.2 GB | ~15.2 GB |
| GPU F16 | ~6.5 GB | ~1.9 GB | ~0.2 GB | ~8.6 GB |
| GPU Q8_0 | ~3.5 GB | ~1.9 GB | ~0.2 GB | ~5.6 GB |
- Downloads last month
- 96
8-bit