Instructions to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF",
	filename="voxtral-realtime-4b-mmproj-f16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16

Use Docker

docker model run hf.co/acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16

LM Studio
Jan
Ollama
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Ollama:
```
ollama run hf.co/acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
```

Unsloth Studio new

How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF to start chatting

Docker Model Runner
How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Docker Model Runner:
```
docker model run hf.co/acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16
```

Lemonade

How to use acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull acceldium/Voxtral-Mini-4B-Realtime-2602_GGUF:F16

Run and chat with the model

lemonade run user.Voxtral-Mini-4B-Realtime-2602_GGUF-F16

List all available models

lemonade list

This repo contains a llama.cpp compatible implementation of https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602

1. Architecture Overview

Voxtral Realtime 4B (model ID: Voxtral-Mini-4B-Realtime-2602) is a dual-stream multimodal speech-to-text model from Mistral AI. Unlike conventional encoder-decoder ASR models, it uses a unique streaming protocol where audio and text embeddings are combined at every position during inference.

Dual-Stream Inference Protocol

At each position pos, the decoder input is:

input[pos] = audio_embed[pos] + text_embed[token_at_pos]

Prefix phase (positions 0..38): token_at_pos = prompt_ids[pos]
- prompt_ids = [BOS] + [STREAMING_PAD] * 38
- 32 left-pad tokens + 6 delay tokens (480ms transcription delay)
Autoregressive phase (positions 39..n_audio-1): token_at_pos = previously_generated_token

This is fundamentally different from standard multimodal models where audio/image embeddings are simply prepended to the text sequence.

2. Model Components

2.1 Mel Spectrogram Preprocessor

Parameter	Value
Sample rate	16,000 Hz
FFT size (n_fft)	400
Hop length	160
Window length	400 (Hann, periodic)
Mel bins	128
Mel scale	Slaney (HTK below 1kHz, log above)
Normalization	Fixed max = 1.5 (GLOBAL_LOG_MEL_MAX)

Key difference from Whisper: The normalization uses a fixed maximum value of 1.5 rather than a data-dependent maximum. The formula is:

log_spec = log10(max(mel_spec, 1e-10))
log_spec = max(log_spec, 1.5 - 8.0)  # clamp floor at -6.5
log_spec = (log_spec + 4.0) / 4.0     # normalize to ~[0, 1]

Streaming padding: Before mel computation, the raw audio is padded:

Left: 32 * 1280 = 40,960 zeros (32 left-pad tokens)
Right: alignment + 17 * 1280 = 21,760 zeros (right-pad tokens)

2.2 Causal Audio Encoder

The encoder is a 32-layer causal transformer with the following architecture:

Parameter	Value
Dimension (n_embd)	1,280
Layers	32
Attention heads	32
Head dimension	64 (note: 64 != 1280/32 = 40)
KV heads	32 (no GQA)
FFN hidden dim	5,120
Sliding window	750
RoPE theta	1,000,000
Norm epsilon	1e-5
Activation	SwiGLU (SiLU gate)
Normalization	RMSNorm

2.2.1 Causal Conv1d Stem

Two convolutional layers with causal (left-only) padding:

Conv0: kernel=3, stride=1, in=128, out=1280
- Left-pad by 2 (= kernel_size - stride)
- Followed by GELU activation
Conv1: kernel=3, stride=2, in=1280, out=1280
- Left-pad by 1 (= kernel_size - stride)
- Plus alignment padding on the right
- Followed by GELU activation
- 2x temporal downsampling

After convolution, the sequence is left-truncated to a multiple of the stack factor (4).

2.2.2 Attention with Sliding Window

Each attention layer uses:

Q, K, V projections: Q and V have bias; K has no bias
RoPE: Interleaved style (mode=0), theta=1M, applied to Q and K
Causal + sliding window mask: Position j can attend to position i only if i <= j AND i >= j - 749
Scale: 1/sqrt(64) (using actual head_dim, not n_embd/n_head)

2.2.3 SwiGLU FFN

gate = SiLU(x @ W1)
up   = x @ W3
down = (gate * up) @ W2 + bias

Where W1 is the gate projection, W3 is the up projection, and W2 is the down projection (with bias).

2.3 Frame Stacking and Adapter

After the encoder, frames are stacked 4x:

Input: [seq_len, 1280]
After stacking: [seq_len/4, 5120]
After adapter MLP: [seq_len/4, 3072]

The adapter is: Linear(5120, 9216) -> GELU -> Linear(9216, 3072) (no bias).

2.4 Text Decoder (Modified Llama)

The decoder is a standard 26-layer Llama model with one key addition:

2.4.1 Adaptive RMSNorm (`ada_rms_norm_t_cond`)

Each decoder layer has an adaptive normalization module that conditions the FFN on a time embedding derived from the transcription delay:

t_cond = sinusoidal_embedding(n_delay_tokens, dim=3072)
# Per layer:
ada_hidden = GELU(t_cond @ ada_down.T)   # [3072] -> [32]
ada_scale = ada_hidden @ ada_up.T         # [32] -> [3072]
ffn_input = ffn_norm(x) * (1 + ada_scale)

For the GGUF implementation, this is precomputed at conversion time for a fixed delay of 480ms (6 delay tokens). The precomputed (1 + ada_scale) vector is stored as blk.{i}.ffn_ada_norm_up.weight and applied as a simple element-wise multiplication after RMSNorm.

2.4.2 Tied Embeddings

The output projection (lm_head) shares weights with the input embedding table (tok_embeddings.weight). The embedding table is stored at mm_streams_embeddings.embedding_module.tok_embeddings.weight in the original safetensors.

Parameter	Value
Dimension	3,072
Layers	26
Attention heads	32
KV heads	8 (GQA ratio = 4)
Head dimension	128
FFN hidden dim	9,216
Sliding window	8,192
RoPE theta	1,000,000
Vocab size	131,072
Ada norm dim	32

3. GGUF Conversion

3.1 Conversion Pipeline

The model is converted from Mistral's safetensors format to two GGUF files:

Text model (voxtral-realtime-4b-text-f16.gguf): Llama decoder + ada_norm tensors
Multimodal projector (voxtral-realtime-4b-mmproj-f16.gguf): Encoder + adapter

3.2 Key Tensor Mappings

Encoder (mmproj GGUF)

Safetensors Name	GGUF Name
`whisper_encoder.conv_layers.0.conv.weight`	`v.enc_conv1d.1.weight`
`whisper_encoder.conv_layers.0.conv.bias`	`v.enc_conv1d.1.bias`
`whisper_encoder.conv_layers.1.conv.weight`	`v.enc_conv1d.2.weight`
`whisper_encoder.conv_layers.1.conv.bias`	`v.enc_conv1d.2.bias`
`whisper_encoder.transformer.layers.{i}.attention.wq.weight`	`v.blk.{i}.attn_q.weight`
`whisper_encoder.transformer.layers.{i}.attention.wq.bias`	`v.blk.{i}.attn_q.bias`
`whisper_encoder.transformer.layers.{i}.attention.wk.weight`	`v.blk.{i}.attn_k.weight`
`whisper_encoder.transformer.layers.{i}.attention.wv.weight`	`v.blk.{i}.attn_v.weight`
`whisper_encoder.transformer.layers.{i}.attention.wv.bias`	`v.blk.{i}.attn_v.bias`
`whisper_encoder.transformer.layers.{i}.attention.wo.weight`	`v.blk.{i}.attn_out.weight`
`whisper_encoder.transformer.layers.{i}.attention.wo.bias`	`v.blk.{i}.attn_out.bias`
`whisper_encoder.transformer.layers.{i}.feed_forward.w1.weight`	`v.blk.{i}.ffn_gate.weight`
`whisper_encoder.transformer.layers.{i}.feed_forward.w2.weight`	`v.blk.{i}.ffn_down.weight`
`whisper_encoder.transformer.layers.{i}.feed_forward.w2.bias`	`v.blk.{i}.ffn_down.bias`
`whisper_encoder.transformer.layers.{i}.feed_forward.w3.weight`	`v.blk.{i}.ffn_up.weight`
`audio_language_projection.0.weight`	`v.mm_audio_mlp.1.weight`
`audio_language_projection.2.weight`	`v.mm_audio_mlp.2.weight`

Decoder (text GGUF)

Safetensors Name	GGUF Name
`tok_embeddings.weight`	`token_embd.weight` (tied with output)
`layers.{i}.ada_rms_norm_t_cond.0.weight`	(consumed during conversion)
`layers.{i}.ada_rms_norm_t_cond.2.weight`	`blk.{i}.ffn_ada_norm_up.weight` (precomputed)

3.3 Ada Norm Precomputation

During GGUF conversion, the ada_rms_norm_t_cond weights are precomputed:

# For each layer:
t_cond = sinusoidal_embedding(6, dim=3072)  # fixed 480ms delay
ada_hidden = GELU(linear(t_cond, ada_down))  # [3072] -> [32]
ada_scale = linear(ada_hidden, ada_up)        # [32] -> [3072]
precomputed = (1.0 + ada_scale)               # [3072]
# Stored as blk.{i}.ffn_ada_norm_up.weight

This eliminates the need for runtime time embedding computation and reduces the ada_norm to a single element-wise multiply.

4. Implementation Details

4.1 Files Modified/Created

New Files

File	Purpose
`tools/mtmd/models/voxtral-realtime-enc.cpp`	Causal audio encoder graph builder
`tools/mtmd/voxtral-stream.h`	Streaming audio infrastructure (ring buffer, mic capture, mel)
`tools/mtmd/voxtral-stream.cpp`	Streaming infrastructure implementation
`tools/mtmd/voxtral-stream-cli.cpp`	Dedicated CLI with dual-stream inference protocol

Modified Files

File	Changes
`tools/mtmd/clip.cpp`	Added `PROJECTOR_TYPE_VOXTRAL_REALTIME` graph dispatch, input setup (positions, sliding window mask), patch count computation
`tools/mtmd/mtmd-audio.h`	Added `mtmd_audio_preprocessor_voxtral_rt` class
`tools/mtmd/mtmd-audio.cpp`	Implemented Voxtral RT preprocessor (center padding, streaming padding, fixed-max normalization); added `fixed_max` to `filter_params`
`tools/mtmd/mtmd.cpp`	Route `PROJECTOR_TYPE_VOXTRAL_REALTIME` to Voxtral RT preprocessor
`tools/mtmd/CMakeLists.txt`	Added `llama-voxtral-stream` build target
`src/models/llama.cpp`	Added adaptive RMSNorm application in FFN block
`src/llama-arch.h`	Added `LLM_TENSOR_FFN_ADA_NORM_DOWN/UP` enum values
`src/llama-arch.cpp`	Added tensor name mappings and info entries
`src/llama-model.h`	Added `ffn_ada_norm_down/up` to `llama_layer` struct
`src/llama-model.cpp`	Load ada_norm tensors for LLM_ARCH_LLAMA
`convert_hf_to_gguf.py`	Tied embeddings fix, ada_norm precomputation, Mistral format support
`gguf-py/gguf/constants.py`	Added `FFN_ADA_NORM_DOWN/UP` model tensor enums
`gguf-py/gguf/tensor_mapping.py`	Added ada_norm tensor name mappings

4.2 Encoder Graph (`voxtral-realtime-enc.cpp`)

The encoder graph is built as a single ggml_cgraph with 1048 nodes:

Causal Conv1d: Uses ggml_pad_ext for asymmetric left-padding + ggml_conv_1d with zero padding
RoPE: ggml_rope_ext with mode=0 (interleaved), theta=1M
Sliding Window Attention: Pre-computed [seq_len, seq_len] mask passed to ggml_soft_max_ext
SwiGLU FFN: build_ffn with FFN_SILU and gate tensor
Frame Stacking: build_stack with factor=4
Adapter: build_ffn with FFN_GELU_ERF

4.3 Decoder Modifications (`src/models/llama.cpp`)

A single conditional block was added to the Llama FFN:

// After ffn_norm, before build_ffn:
if (model.layers[il].ffn_ada_norm_up) {
    cur = ggml_mul(ctx0, cur, model.layers[il].ffn_ada_norm_up);
}

4.4 Audio Preprocessor (`mtmd-audio.cpp`)

The mtmd_audio_preprocessor_voxtral_rt class:

Applies streaming padding (left=40,960, right=alignment+21,760 zeros)
Computes mel spectrogram with center_padding=true (matching torch.stft)
Uses fixed_max=1.5 for normalization (not data-dependent)
Drops first frame if mel length is odd
Outputs a single mel chunk (no 3000-frame splitting)

4.5 Dual-Stream CLI (`voxtral-stream-cli.cpp`)

The CLI implements the full Voxtral Realtime inference protocol:

Token Embedding Table: Loaded from the text model GGUF file at startup using gguf_init_from_file. Supports F32, F16, and quantized types via ggml_get_type_traits.
Prefill: Combines audio embeddings with prompt token embeddings (BOS + 38*STREAMING_PAD) and feeds them as llama_batch.embd.
Autoregressive Generation: At each step, combines audio_embed[pos] + text_embed[prev_token] and feeds as embedding.
Streaming Markers: Filters [STREAMING_PAD] and [STREAMING_WORD] tokens from output.

5. Numerical Parity Validation

5.1 Methodology

Numerical parity was validated by comparing the C++ llama.cpp implementation against a standalone Python reference implementation (reference_inference.py) that uses PyTorch + safetensors directly (no vLLM or transformers dependency).

5.2 Component-by-Component Validation

5.2.1 Mel Spectrogram

Metric	Value
Max absolute difference	0.00002
Mean absolute difference	0.0000001
Frames with diff > 0.001	0

Method: Created parity_check2.py that computes mel using both torch.stft (Python reference) and a NumPy simulation of the C++ log_mel_spectrogram function with center_padding=true and fixed_max=1.5. The mel filterbanks are identical (max diff = 0.0). The tiny remaining difference comes from floating-point FFT implementation differences.

5.2.2 Causal Conv1d

Validated by: Ensuring the Python reference causal_conv1d function (left-pad only) matches the C++ ggml_pad_ext + ggml_conv_1d implementation. The output sequence lengths match exactly.

Key fix: The initial implementation used ggml_conv_1d_ph (symmetric half-padding), which was replaced with ggml_pad_ext (asymmetric left-only padding) to match the causal behavior.

5.2.3 RoPE

Validated by: The Python reference uses interleaved RoPE:

x1 = x[..., ::2]   # even indices
x2 = x[..., 1::2]  # odd indices
o1 = x1 * cos - x2 * sin
o2 = x2 * cos + x1 * sin
out = stack([o1, o2], dim=-1).flatten(-2)

The C++ uses ggml_rope_ext with mode=0 which implements the same interleaved pattern.

5.2.4 Sliding Window Attention

Validated by: The Python reference uses:

attn_mask = (kv_abs <= qi_abs) & (kv_abs >= (qi_abs - (window - 1)))

The C++ uses a pre-computed mask tensor filled with:

mask[i][j] = (j <= i && j >= i - 749) ? 0.0f : -INFINITY;

Both implement the same causal + sliding window pattern with window=750.

5.2.5 Adaptive RMSNorm

Validated by: The precomputed approach was verified by comparing the precomputed (1 + ada_scale) vector against the runtime computation in the Python reference. Since the delay is fixed at 480ms (6 tokens), the precomputed values are exact.

5.2.6 Full Pipeline (End-to-End)

Test	Python Reference	C++ (llama.cpp)	Match
test01_20s.wav (24s speech)	"Dancing in the masquerade, idle truth in plain sight jaded, pop, roll, click, shot, who will I be today or not? But such a tide as moving seems asleep, too full for sound and foam, when that which drew from out the boundless deep turns again home, twilight and evening bell, and after that,"	Identical	Word-for-word

The C++ implementation produces word-for-word identical output to the Python reference on the test audio file. This confirms numerical parity across all components: mel spectrogram, encoder, adapter, and decoder (including ada_norm and the dual-stream protocol).

5.3 Parity Test Scripts

Script	Purpose
`debug_encoder.py`	Layer-0 forward pass trace for NaN debugging
`reference_inference.py`	Full standalone Python inference pipeline
`parity_check.py`	Mel spectrogram comparison (Python vs Whisper-style)
`parity_check2.py`	Mel spectrogram comparison (Python vs C++ center-padded)
`benchmark_hf.py`	HF Python performance benchmark

6. Streaming Infrastructure

6.1 Components

The streaming infrastructure (voxtral-stream.h/.cpp) provides cross-platform real-time audio capture and processing:

6.1.1 Ring Buffer (`voxtral_ring_buffer`)

Lock-free circular buffer using std::atomic for thread-safe producer-consumer
Used between the audio capture callback and the main processing thread

6.1.2 Microphone Capture (`voxtral_mic_capture`)

Uses miniaudio (already vendored in llama.cpp) for cross-platform audio input
Supports Windows, macOS, and Linux
Configurable sample rate and buffer duration
Audio callback writes to ring buffer

6.1.3 Streaming Mel Preprocessor (`voxtral_streaming_mel`)

Incremental mel spectrogram computation from raw PCM samples
Self-contained FFT and mel filterbank cache
Produces voxtral_mel_chunk structures compatible with the mtmd pipeline

6.2 CLI Modes

The llama-voxtral-stream CLI supports two modes:

Microphone mode (default): Captures audio until Ctrl+C, then transcribes
File mode (--image <file.wav>): Loads and transcribes a WAV file

7. Performance Benchmarks

7.1 Test Configuration

Parameter	Value
Test file	`test01_20s.wav` (24.0 seconds, 16kHz mono)
Audio tokens	349
Generated tokens	311
Hardware	CPU only (no GPU offload)
Precision	F16 weights, F32 compute
Platform	Windows 10, MSVC 19.44

7.2 Latency Comparison

Stage	Python HF (ms)	C++ llama.cpp (ms)	Speedup
Mel spectrogram	9.6	(included in encode)	-
Encoder (32 layers)	4,766	9,352 (encode total)	0.5x
Adapter	38	(included in encode)	-
Decoder prefill (39 tok)	469	605	0.8x
Decoder generation (310 steps)	76,230	31,109	2.4x
Total inference	81,514	42,709	1.9x

Metric	Python HF	C++ llama.cpp
Per-token generation	245.9 ms	100.4 ms
Real-time factor (RTF)	3.40x	1.78x
Tokens per second (generation)	4.1 tok/s	10.0 tok/s

7.3 Memory Comparison

Metric	Python HF	C++ llama.cpp (ctx=2048)	C++ llama.cpp (ctx=131072)
Baseline	417 MB	-	-
After encoder	2,636 MB	-	-
After decoder load	22,153 MB	-	-
Peak memory	22,300 MB	10,607 MB	23,718 MB
Delta from baseline	+21,883 MB	-	-

Key observations:

The Python HF reference loads all 26 decoder layers into memory as F32 tensors, consuming ~21 GB
The C++ implementation with ctx=2048 uses only 10.6 GB (2.1x less than Python)
The C++ default context (131072) allocates a 13 GB KV cache, which inflates memory to 23.7 GB
With a practical context size (2048), the C++ implementation is significantly more memory-efficient

7.4 Memory Breakdown (C++, ctx=2048)

Component	Size
Text model (F16)	6,541 MB
Encoder model (F16)	1,899 MB
KV cache (2048 cells)	208 MB
Compute buffer	268 MB
Encoder compute buffer	346 MB
Token embedding table (F32)	~1,536 MB
Other overhead	~100 MB
Total	~10,607 MB

7.5 Summary

Metric	Python HF	C++ llama.cpp	Winner
Total latency	81.5s	42.7s	C++ (1.9x faster)
Generation speed	4.1 tok/s	10.0 tok/s	C++ (2.4x faster)
Peak memory (practical)	22.3 GB	10.6 GB	C++ (2.1x less)
Real-time factor	3.40x	1.78x	C++ (closer to real-time)
Output quality	Reference	Word-for-word identical	Tie

& "llama-voxtral-stream.exe" -m "voxtral-realtime-4b-text-f16.gguf" --mmproj "voxtral-realtime-4b-mmproj-f16.gguf" --image "test01_20s.wav" -n 100 -t 8 --no-mmproj-offload 2>&1

9. File Manifest

New Files Created

tools/mtmd/models/voxtral-realtime-enc.cpp   - Causal audio encoder graph
tools/mtmd/voxtral-stream.h                  - Streaming infrastructure header
tools/mtmd/voxtral-stream.cpp                - Streaming infrastructure implementation
tools/mtmd/voxtral-stream-cli.cpp            - Dual-stream inference CLI

Modified llama.cpp Core Files

src/llama-arch.h          - LLM_TENSOR_FFN_ADA_NORM_DOWN/UP enums
src/llama-arch.cpp        - Tensor name mappings, info entries
src/llama-model.h         - ffn_ada_norm_down/up in llama_layer
src/llama-model.cpp       - Load ada_norm tensors
src/models/llama.cpp      - Apply ada_norm in FFN block

Modified Multimodal Files

tools/mtmd/clip.cpp       - Voxtral RT graph dispatch, input setup, patch count
tools/mtmd/mtmd-audio.h   - Voxtral RT preprocessor class
tools/mtmd/mtmd-audio.cpp - Voxtral RT preprocessor, fixed_max normalization
tools/mtmd/mtmd.cpp       - Route to Voxtral RT preprocessor
tools/mtmd/CMakeLists.txt - Build target for llama-voxtral-stream

Modified Conversion Files

convert_hf_to_gguf.py             - Tied embeddings, ada_norm precomputation
gguf-py/gguf/constants.py         - FFN_ADA_NORM_DOWN/UP tensor enums
gguf-py/gguf/tensor_mapping.py    - Ada norm tensor name mappings

Test and Benchmark Scripts

reference_inference.py    - Standalone Python reference implementation
debug_encoder.py          - Encoder layer-0 trace for NaN debugging
parity_check.py           - Mel spectrogram comparison v1
parity_check2.py          - Mel spectrogram comparison v2 (center-padded)
benchmark_hf.py           - Python HF performance benchmark

10. CUDA GPU Acceleration

Build with CUDA

# Reconfigure with CUDA enabled
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release

# Build all targets (includes CUDA kernel compilation, ~13 min first time)
cmake --build build --config Release

# Or build only the voxtral streaming tool
cmake --build build --config Release --target llama-voxtral-stream

Model Quantization

Three text model variants were produced from the original Mistral-format safetensors:

Model	File	Size	Description
F32	`voxtral-realtime-4b-text-f32.gguf`	13,089 MB	Full precision, highest accuracy
F16	`voxtral-realtime-4b-text-f16.gguf`	6,549 MB	Half precision (original conversion)
Q8_0	`voxtral-realtime-4b-text-q8_0.gguf`	3,483 MB	8-bit quantized, best speed/quality

The audio encoder (mmproj) remains at F16 (1,900 MB) for all configurations.

Quantization commands:

# F16 -> F32 (for reference/parity testing)
llama-quantize --allow-requantize voxtral-realtime-4b-text-f16.gguf voxtral-realtime-4b-text-f32.gguf F32 8

# F16 -> Q8_0 (recommended for production)
llama-quantize voxtral-realtime-4b-text-f16.gguf voxtral-realtime-4b-text-q8_0.gguf Q8_0 8

GPU Performance Results

Test: test01_20s.wav (24.0 seconds of audio), 100 tokens generated, RTX 5090 Laptop GPU, 8 threads.

Config	Encode	Decode	Total	tok/s	RTF	Speedup vs CPU
CPU F16 (baseline)	12,497 ms	10,370 ms	25,061 ms	9.5	1.00x	1.0x
GPU F32 `-ngl 99`	286 ms	2,893 ms	3,179 ms	34.6	0.13x	3.6x decode
GPU F16 `-ngl 99`	460 ms	1,522 ms	1,982 ms	65.7	0.08x	6.8x decode
GPU Q8_0 `-ngl 99`	278 ms	1,047 ms	1,325 ms	95.5	0.06x	10x decode

Key observations:

Audio encoder: 12.5s -> 278ms (45x faster on GPU)
Text decode: 9.5 -> 95.5 tok/s (10x faster with Q8_0)
RTF 0.06x: Processes 24s of audio in 1.3s (16x faster than real-time)
Q8_0 transcription quality: Identical words, minor capitalization differences ("masquerade" vs "Masquerade")
F32 is slower than F16: Expected, as F32 uses 2x memory bandwidth with no quality benefit for inference

Run Commands

File transcription (recommended: Q8_0 on GPU):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-q8_0.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 -ngl 99 -c 2048

File transcription (F32 on GPU, for parity testing):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-f32.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 -ngl 99 -c 2048

File transcription (F16 on GPU):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-f16.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 -ngl 99 -c 2048

Live microphone transcription (Q8_0 on GPU):

llama-voxtral-stream.exe -m voxtral-realtime-4b-text-q8_0.gguf --mmproj voxtral-realtime-4b-mmproj-f16.gguf  -n 500 -t 8 -ngl 99 -c 2048

CPU-only mode (no GPU, for comparison):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-f16.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 --no-mmproj-offload -ngl 0 -c 2048

CLI Flags Reference

Flag	Description
`-m <path>`	Text model GGUF file (required)
`--mmproj <path>`	Audio encoder GGUF file (required)
`--image <path>`	Audio file to transcribe (omit for live mic)
`-n <N>`	Max tokens to generate per chunk (default: 500)
`-t <N>`	CPU threads (default: 4)
`-ngl <N>`	GPU layers to offload (99 or `all` for full GPU)
`-c <N>`	Context size (default: 2048, sufficient for streaming)
`--no-mmproj-offload`	Force audio encoder to CPU
`--verbose-prompt`	Show full model-loading and debug output

VRAM Usage

Config	Text Model	Encoder	KV Cache	Total VRAM
GPU F32	~13.1 GB	~1.9 GB	~0.2 GB	~15.2 GB
GPU F16	~6.5 GB	~1.9 GB	~0.2 GB	~8.6 GB
GPU Q8_0	~3.5 GB	~1.9 GB	~0.2 GB	~5.6 GB

Downloads last month: 96

GGUF

Model size

3B params

Architecture

llama

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support