GGUF
conversational

This repo contains a llama.cpp compatible implementation of https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602

1. Architecture Overview

Voxtral Realtime 4B (model ID: Voxtral-Mini-4B-Realtime-2602) is a dual-stream multimodal speech-to-text model from Mistral AI. Unlike conventional encoder-decoder ASR models, it uses a unique streaming protocol where audio and text embeddings are combined at every position during inference.

Dual-Stream Inference Protocol

At each position pos, the decoder input is:

input[pos] = audio_embed[pos] + text_embed[token_at_pos]
  • Prefix phase (positions 0..38): token_at_pos = prompt_ids[pos]
    • prompt_ids = [BOS] + [STREAMING_PAD] * 38
    • 32 left-pad tokens + 6 delay tokens (480ms transcription delay)
  • Autoregressive phase (positions 39..n_audio-1): token_at_pos = previously_generated_token

This is fundamentally different from standard multimodal models where audio/image embeddings are simply prepended to the text sequence.


2. Model Components

2.1 Mel Spectrogram Preprocessor

Parameter Value
Sample rate 16,000 Hz
FFT size (n_fft) 400
Hop length 160
Window length 400 (Hann, periodic)
Mel bins 128
Mel scale Slaney (HTK below 1kHz, log above)
Normalization Fixed max = 1.5 (GLOBAL_LOG_MEL_MAX)

Key difference from Whisper: The normalization uses a fixed maximum value of 1.5 rather than a data-dependent maximum. The formula is:

log_spec = log10(max(mel_spec, 1e-10))
log_spec = max(log_spec, 1.5 - 8.0)  # clamp floor at -6.5
log_spec = (log_spec + 4.0) / 4.0     # normalize to ~[0, 1]

Streaming padding: Before mel computation, the raw audio is padded:

  • Left: 32 * 1280 = 40,960 zeros (32 left-pad tokens)
  • Right: alignment + 17 * 1280 = 21,760 zeros (right-pad tokens)

2.2 Causal Audio Encoder

The encoder is a 32-layer causal transformer with the following architecture:

Parameter Value
Dimension (n_embd) 1,280
Layers 32
Attention heads 32
Head dimension 64 (note: 64 != 1280/32 = 40)
KV heads 32 (no GQA)
FFN hidden dim 5,120
Sliding window 750
RoPE theta 1,000,000
Norm epsilon 1e-5
Activation SwiGLU (SiLU gate)
Normalization RMSNorm

2.2.1 Causal Conv1d Stem

Two convolutional layers with causal (left-only) padding:

  • Conv0: kernel=3, stride=1, in=128, out=1280
    • Left-pad by 2 (= kernel_size - stride)
    • Followed by GELU activation
  • Conv1: kernel=3, stride=2, in=1280, out=1280
    • Left-pad by 1 (= kernel_size - stride)
    • Plus alignment padding on the right
    • Followed by GELU activation
    • 2x temporal downsampling

After convolution, the sequence is left-truncated to a multiple of the stack factor (4).

2.2.2 Attention with Sliding Window

Each attention layer uses:

  • Q, K, V projections: Q and V have bias; K has no bias
  • RoPE: Interleaved style (mode=0), theta=1M, applied to Q and K
  • Causal + sliding window mask: Position j can attend to position i only if i <= j AND i >= j - 749
  • Scale: 1/sqrt(64) (using actual head_dim, not n_embd/n_head)

2.2.3 SwiGLU FFN

gate = SiLU(x @ W1)
up   = x @ W3
down = (gate * up) @ W2 + bias

Where W1 is the gate projection, W3 is the up projection, and W2 is the down projection (with bias).

2.3 Frame Stacking and Adapter

After the encoder, frames are stacked 4x:

  • Input: [seq_len, 1280]
  • After stacking: [seq_len/4, 5120]
  • After adapter MLP: [seq_len/4, 3072]

The adapter is: Linear(5120, 9216) -> GELU -> Linear(9216, 3072) (no bias).

2.4 Text Decoder (Modified Llama)

The decoder is a standard 26-layer Llama model with one key addition:

2.4.1 Adaptive RMSNorm (ada_rms_norm_t_cond)

Each decoder layer has an adaptive normalization module that conditions the FFN on a time embedding derived from the transcription delay:

t_cond = sinusoidal_embedding(n_delay_tokens, dim=3072)
# Per layer:
ada_hidden = GELU(t_cond @ ada_down.T)   # [3072] -> [32]
ada_scale = ada_hidden @ ada_up.T         # [32] -> [3072]
ffn_input = ffn_norm(x) * (1 + ada_scale)

For the GGUF implementation, this is precomputed at conversion time for a fixed delay of 480ms (6 delay tokens). The precomputed (1 + ada_scale) vector is stored as blk.{i}.ffn_ada_norm_up.weight and applied as a simple element-wise multiplication after RMSNorm.

2.4.2 Tied Embeddings

The output projection (lm_head) shares weights with the input embedding table (tok_embeddings.weight). The embedding table is stored at mm_streams_embeddings.embedding_module.tok_embeddings.weight in the original safetensors.

Parameter Value
Dimension 3,072
Layers 26
Attention heads 32
KV heads 8 (GQA ratio = 4)
Head dimension 128
FFN hidden dim 9,216
Sliding window 8,192
RoPE theta 1,000,000
Vocab size 131,072
Ada norm dim 32

3. GGUF Conversion

3.1 Conversion Pipeline

The model is converted from Mistral's safetensors format to two GGUF files:

  1. Text model (voxtral-realtime-4b-text-f16.gguf): Llama decoder + ada_norm tensors
  2. Multimodal projector (voxtral-realtime-4b-mmproj-f16.gguf): Encoder + adapter

3.2 Key Tensor Mappings

Encoder (mmproj GGUF)

Safetensors Name GGUF Name
whisper_encoder.conv_layers.0.conv.weight v.enc_conv1d.1.weight
whisper_encoder.conv_layers.0.conv.bias v.enc_conv1d.1.bias
whisper_encoder.conv_layers.1.conv.weight v.enc_conv1d.2.weight
whisper_encoder.conv_layers.1.conv.bias v.enc_conv1d.2.bias
whisper_encoder.transformer.layers.{i}.attention.wq.weight v.blk.{i}.attn_q.weight
whisper_encoder.transformer.layers.{i}.attention.wq.bias v.blk.{i}.attn_q.bias
whisper_encoder.transformer.layers.{i}.attention.wk.weight v.blk.{i}.attn_k.weight
whisper_encoder.transformer.layers.{i}.attention.wv.weight v.blk.{i}.attn_v.weight
whisper_encoder.transformer.layers.{i}.attention.wv.bias v.blk.{i}.attn_v.bias
whisper_encoder.transformer.layers.{i}.attention.wo.weight v.blk.{i}.attn_out.weight
whisper_encoder.transformer.layers.{i}.attention.wo.bias v.blk.{i}.attn_out.bias
whisper_encoder.transformer.layers.{i}.feed_forward.w1.weight v.blk.{i}.ffn_gate.weight
whisper_encoder.transformer.layers.{i}.feed_forward.w2.weight v.blk.{i}.ffn_down.weight
whisper_encoder.transformer.layers.{i}.feed_forward.w2.bias v.blk.{i}.ffn_down.bias
whisper_encoder.transformer.layers.{i}.feed_forward.w3.weight v.blk.{i}.ffn_up.weight
audio_language_projection.0.weight v.mm_audio_mlp.1.weight
audio_language_projection.2.weight v.mm_audio_mlp.2.weight

Decoder (text GGUF)

Safetensors Name GGUF Name
tok_embeddings.weight token_embd.weight (tied with output)
layers.{i}.ada_rms_norm_t_cond.0.weight (consumed during conversion)
layers.{i}.ada_rms_norm_t_cond.2.weight blk.{i}.ffn_ada_norm_up.weight (precomputed)

3.3 Ada Norm Precomputation

During GGUF conversion, the ada_rms_norm_t_cond weights are precomputed:

# For each layer:
t_cond = sinusoidal_embedding(6, dim=3072)  # fixed 480ms delay
ada_hidden = GELU(linear(t_cond, ada_down))  # [3072] -> [32]
ada_scale = linear(ada_hidden, ada_up)        # [32] -> [3072]
precomputed = (1.0 + ada_scale)               # [3072]
# Stored as blk.{i}.ffn_ada_norm_up.weight

This eliminates the need for runtime time embedding computation and reduces the ada_norm to a single element-wise multiply.


4. Implementation Details

4.1 Files Modified/Created

New Files

File Purpose
tools/mtmd/models/voxtral-realtime-enc.cpp Causal audio encoder graph builder
tools/mtmd/voxtral-stream.h Streaming audio infrastructure (ring buffer, mic capture, mel)
tools/mtmd/voxtral-stream.cpp Streaming infrastructure implementation
tools/mtmd/voxtral-stream-cli.cpp Dedicated CLI with dual-stream inference protocol

Modified Files

File Changes
tools/mtmd/clip.cpp Added PROJECTOR_TYPE_VOXTRAL_REALTIME graph dispatch, input setup (positions, sliding window mask), patch count computation
tools/mtmd/mtmd-audio.h Added mtmd_audio_preprocessor_voxtral_rt class
tools/mtmd/mtmd-audio.cpp Implemented Voxtral RT preprocessor (center padding, streaming padding, fixed-max normalization); added fixed_max to filter_params
tools/mtmd/mtmd.cpp Route PROJECTOR_TYPE_VOXTRAL_REALTIME to Voxtral RT preprocessor
tools/mtmd/CMakeLists.txt Added llama-voxtral-stream build target
src/models/llama.cpp Added adaptive RMSNorm application in FFN block
src/llama-arch.h Added LLM_TENSOR_FFN_ADA_NORM_DOWN/UP enum values
src/llama-arch.cpp Added tensor name mappings and info entries
src/llama-model.h Added ffn_ada_norm_down/up to llama_layer struct
src/llama-model.cpp Load ada_norm tensors for LLM_ARCH_LLAMA
convert_hf_to_gguf.py Tied embeddings fix, ada_norm precomputation, Mistral format support
gguf-py/gguf/constants.py Added FFN_ADA_NORM_DOWN/UP model tensor enums
gguf-py/gguf/tensor_mapping.py Added ada_norm tensor name mappings

4.2 Encoder Graph (voxtral-realtime-enc.cpp)

The encoder graph is built as a single ggml_cgraph with 1048 nodes:

  1. Causal Conv1d: Uses ggml_pad_ext for asymmetric left-padding + ggml_conv_1d with zero padding
  2. RoPE: ggml_rope_ext with mode=0 (interleaved), theta=1M
  3. Sliding Window Attention: Pre-computed [seq_len, seq_len] mask passed to ggml_soft_max_ext
  4. SwiGLU FFN: build_ffn with FFN_SILU and gate tensor
  5. Frame Stacking: build_stack with factor=4
  6. Adapter: build_ffn with FFN_GELU_ERF

4.3 Decoder Modifications (src/models/llama.cpp)

A single conditional block was added to the Llama FFN:

// After ffn_norm, before build_ffn:
if (model.layers[il].ffn_ada_norm_up) {
    cur = ggml_mul(ctx0, cur, model.layers[il].ffn_ada_norm_up);
}

4.4 Audio Preprocessor (mtmd-audio.cpp)

The mtmd_audio_preprocessor_voxtral_rt class:

  1. Applies streaming padding (left=40,960, right=alignment+21,760 zeros)
  2. Computes mel spectrogram with center_padding=true (matching torch.stft)
  3. Uses fixed_max=1.5 for normalization (not data-dependent)
  4. Drops first frame if mel length is odd
  5. Outputs a single mel chunk (no 3000-frame splitting)

4.5 Dual-Stream CLI (voxtral-stream-cli.cpp)

The CLI implements the full Voxtral Realtime inference protocol:

  1. Token Embedding Table: Loaded from the text model GGUF file at startup using gguf_init_from_file. Supports F32, F16, and quantized types via ggml_get_type_traits.
  2. Prefill: Combines audio embeddings with prompt token embeddings (BOS + 38*STREAMING_PAD) and feeds them as llama_batch.embd.
  3. Autoregressive Generation: At each step, combines audio_embed[pos] + text_embed[prev_token] and feeds as embedding.
  4. Streaming Markers: Filters [STREAMING_PAD] and [STREAMING_WORD] tokens from output.

5. Numerical Parity Validation

5.1 Methodology

Numerical parity was validated by comparing the C++ llama.cpp implementation against a standalone Python reference implementation (reference_inference.py) that uses PyTorch + safetensors directly (no vLLM or transformers dependency).

5.2 Component-by-Component Validation

5.2.1 Mel Spectrogram

Metric Value
Max absolute difference 0.00002
Mean absolute difference 0.0000001
Frames with diff > 0.001 0

Method: Created parity_check2.py that computes mel using both torch.stft (Python reference) and a NumPy simulation of the C++ log_mel_spectrogram function with center_padding=true and fixed_max=1.5. The mel filterbanks are identical (max diff = 0.0). The tiny remaining difference comes from floating-point FFT implementation differences.

5.2.2 Causal Conv1d

Validated by: Ensuring the Python reference causal_conv1d function (left-pad only) matches the C++ ggml_pad_ext + ggml_conv_1d implementation. The output sequence lengths match exactly.

Key fix: The initial implementation used ggml_conv_1d_ph (symmetric half-padding), which was replaced with ggml_pad_ext (asymmetric left-only padding) to match the causal behavior.

5.2.3 RoPE

Validated by: The Python reference uses interleaved RoPE:

x1 = x[..., ::2]   # even indices
x2 = x[..., 1::2]  # odd indices
o1 = x1 * cos - x2 * sin
o2 = x2 * cos + x1 * sin
out = stack([o1, o2], dim=-1).flatten(-2)

The C++ uses ggml_rope_ext with mode=0 which implements the same interleaved pattern.

5.2.4 Sliding Window Attention

Validated by: The Python reference uses:

attn_mask = (kv_abs <= qi_abs) & (kv_abs >= (qi_abs - (window - 1)))

The C++ uses a pre-computed mask tensor filled with:

mask[i][j] = (j <= i && j >= i - 749) ? 0.0f : -INFINITY;

Both implement the same causal + sliding window pattern with window=750.

5.2.5 Adaptive RMSNorm

Validated by: The precomputed approach was verified by comparing the precomputed (1 + ada_scale) vector against the runtime computation in the Python reference. Since the delay is fixed at 480ms (6 tokens), the precomputed values are exact.

5.2.6 Full Pipeline (End-to-End)

Test Python Reference C++ (llama.cpp) Match
test01_20s.wav (24s speech) "Dancing in the masquerade, idle truth in plain sight jaded, pop, roll, click, shot, who will I be today or not? But such a tide as moving seems asleep, too full for sound and foam, when that which drew from out the boundless deep turns again home, twilight and evening bell, and after that," Identical Word-for-word

The C++ implementation produces word-for-word identical output to the Python reference on the test audio file. This confirms numerical parity across all components: mel spectrogram, encoder, adapter, and decoder (including ada_norm and the dual-stream protocol).

5.3 Parity Test Scripts

Script Purpose
debug_encoder.py Layer-0 forward pass trace for NaN debugging
reference_inference.py Full standalone Python inference pipeline
parity_check.py Mel spectrogram comparison (Python vs Whisper-style)
parity_check2.py Mel spectrogram comparison (Python vs C++ center-padded)
benchmark_hf.py HF Python performance benchmark

6. Streaming Infrastructure

6.1 Components

The streaming infrastructure (voxtral-stream.h/.cpp) provides cross-platform real-time audio capture and processing:

6.1.1 Ring Buffer (voxtral_ring_buffer)

  • Lock-free circular buffer using std::atomic for thread-safe producer-consumer
  • Used between the audio capture callback and the main processing thread

6.1.2 Microphone Capture (voxtral_mic_capture)

  • Uses miniaudio (already vendored in llama.cpp) for cross-platform audio input
  • Supports Windows, macOS, and Linux
  • Configurable sample rate and buffer duration
  • Audio callback writes to ring buffer

6.1.3 Streaming Mel Preprocessor (voxtral_streaming_mel)

  • Incremental mel spectrogram computation from raw PCM samples
  • Self-contained FFT and mel filterbank cache
  • Produces voxtral_mel_chunk structures compatible with the mtmd pipeline

6.2 CLI Modes

The llama-voxtral-stream CLI supports two modes:

  1. Microphone mode (default): Captures audio until Ctrl+C, then transcribes
  2. File mode (--image <file.wav>): Loads and transcribes a WAV file

7. Performance Benchmarks

7.1 Test Configuration

Parameter Value
Test file test01_20s.wav (24.0 seconds, 16kHz mono)
Audio tokens 349
Generated tokens 311
Hardware CPU only (no GPU offload)
Precision F16 weights, F32 compute
Platform Windows 10, MSVC 19.44

7.2 Latency Comparison

Stage Python HF (ms) C++ llama.cpp (ms) Speedup
Mel spectrogram 9.6 (included in encode) -
Encoder (32 layers) 4,766 9,352 (encode total) 0.5x
Adapter 38 (included in encode) -
Decoder prefill (39 tok) 469 605 0.8x
Decoder generation (310 steps) 76,230 31,109 2.4x
Total inference 81,514 42,709 1.9x
Metric Python HF C++ llama.cpp
Per-token generation 245.9 ms 100.4 ms
Real-time factor (RTF) 3.40x 1.78x
Tokens per second (generation) 4.1 tok/s 10.0 tok/s

7.3 Memory Comparison

Metric Python HF C++ llama.cpp (ctx=2048) C++ llama.cpp (ctx=131072)
Baseline 417 MB - -
After encoder 2,636 MB - -
After decoder load 22,153 MB - -
Peak memory 22,300 MB 10,607 MB 23,718 MB
Delta from baseline +21,883 MB - -

Key observations:

  • The Python HF reference loads all 26 decoder layers into memory as F32 tensors, consuming ~21 GB
  • The C++ implementation with ctx=2048 uses only 10.6 GB (2.1x less than Python)
  • The C++ default context (131072) allocates a 13 GB KV cache, which inflates memory to 23.7 GB
  • With a practical context size (2048), the C++ implementation is significantly more memory-efficient

7.4 Memory Breakdown (C++, ctx=2048)

Component Size
Text model (F16) 6,541 MB
Encoder model (F16) 1,899 MB
KV cache (2048 cells) 208 MB
Compute buffer 268 MB
Encoder compute buffer 346 MB
Token embedding table (F32) ~1,536 MB
Other overhead ~100 MB
Total ~10,607 MB

7.5 Summary

Metric Python HF C++ llama.cpp Winner
Total latency 81.5s 42.7s C++ (1.9x faster)
Generation speed 4.1 tok/s 10.0 tok/s C++ (2.4x faster)
Peak memory (practical) 22.3 GB 10.6 GB C++ (2.1x less)
Real-time factor 3.40x 1.78x C++ (closer to real-time)
Output quality Reference Word-for-word identical Tie

& "llama-voxtral-stream.exe" -m "voxtral-realtime-4b-text-f16.gguf" --mmproj "voxtral-realtime-4b-mmproj-f16.gguf" --image "test01_20s.wav" -n 100 -t 8 --no-mmproj-offload 2>&1

9. File Manifest

New Files Created

tools/mtmd/models/voxtral-realtime-enc.cpp   - Causal audio encoder graph
tools/mtmd/voxtral-stream.h                  - Streaming infrastructure header
tools/mtmd/voxtral-stream.cpp                - Streaming infrastructure implementation
tools/mtmd/voxtral-stream-cli.cpp            - Dual-stream inference CLI

Modified llama.cpp Core Files

src/llama-arch.h          - LLM_TENSOR_FFN_ADA_NORM_DOWN/UP enums
src/llama-arch.cpp        - Tensor name mappings, info entries
src/llama-model.h         - ffn_ada_norm_down/up in llama_layer
src/llama-model.cpp       - Load ada_norm tensors
src/models/llama.cpp      - Apply ada_norm in FFN block

Modified Multimodal Files

tools/mtmd/clip.cpp       - Voxtral RT graph dispatch, input setup, patch count
tools/mtmd/mtmd-audio.h   - Voxtral RT preprocessor class
tools/mtmd/mtmd-audio.cpp - Voxtral RT preprocessor, fixed_max normalization
tools/mtmd/mtmd.cpp       - Route to Voxtral RT preprocessor
tools/mtmd/CMakeLists.txt - Build target for llama-voxtral-stream

Modified Conversion Files

convert_hf_to_gguf.py             - Tied embeddings, ada_norm precomputation
gguf-py/gguf/constants.py         - FFN_ADA_NORM_DOWN/UP tensor enums
gguf-py/gguf/tensor_mapping.py    - Ada norm tensor name mappings

Test and Benchmark Scripts

reference_inference.py    - Standalone Python reference implementation
debug_encoder.py          - Encoder layer-0 trace for NaN debugging
parity_check.py           - Mel spectrogram comparison v1
parity_check2.py          - Mel spectrogram comparison v2 (center-padded)
benchmark_hf.py           - Python HF performance benchmark

10. CUDA GPU Acceleration

Build with CUDA

# Reconfigure with CUDA enabled
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release

# Build all targets (includes CUDA kernel compilation, ~13 min first time)
cmake --build build --config Release

# Or build only the voxtral streaming tool
cmake --build build --config Release --target llama-voxtral-stream

Model Quantization

Three text model variants were produced from the original Mistral-format safetensors:

Model File Size Description
F32 voxtral-realtime-4b-text-f32.gguf 13,089 MB Full precision, highest accuracy
F16 voxtral-realtime-4b-text-f16.gguf 6,549 MB Half precision (original conversion)
Q8_0 voxtral-realtime-4b-text-q8_0.gguf 3,483 MB 8-bit quantized, best speed/quality

The audio encoder (mmproj) remains at F16 (1,900 MB) for all configurations.

Quantization commands:

# F16 -> F32 (for reference/parity testing)
llama-quantize --allow-requantize voxtral-realtime-4b-text-f16.gguf voxtral-realtime-4b-text-f32.gguf F32 8

# F16 -> Q8_0 (recommended for production)
llama-quantize voxtral-realtime-4b-text-f16.gguf voxtral-realtime-4b-text-q8_0.gguf Q8_0 8

GPU Performance Results

Test: test01_20s.wav (24.0 seconds of audio), 100 tokens generated, RTX 5090 Laptop GPU, 8 threads.

Config Encode Decode Total tok/s RTF Speedup vs CPU
CPU F16 (baseline) 12,497 ms 10,370 ms 25,061 ms 9.5 1.00x 1.0x
GPU F32 -ngl 99 286 ms 2,893 ms 3,179 ms 34.6 0.13x 3.6x decode
GPU F16 -ngl 99 460 ms 1,522 ms 1,982 ms 65.7 0.08x 6.8x decode
GPU Q8_0 -ngl 99 278 ms 1,047 ms 1,325 ms 95.5 0.06x 10x decode

Key observations:

  • Audio encoder: 12.5s -> 278ms (45x faster on GPU)
  • Text decode: 9.5 -> 95.5 tok/s (10x faster with Q8_0)
  • RTF 0.06x: Processes 24s of audio in 1.3s (16x faster than real-time)
  • Q8_0 transcription quality: Identical words, minor capitalization differences ("masquerade" vs "Masquerade")
  • F32 is slower than F16: Expected, as F32 uses 2x memory bandwidth with no quality benefit for inference

Run Commands

File transcription (recommended: Q8_0 on GPU):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-q8_0.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 -ngl 99 -c 2048

File transcription (F32 on GPU, for parity testing):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-f32.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 -ngl 99 -c 2048

File transcription (F16 on GPU):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-f16.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 -ngl 99 -c 2048

Live microphone transcription (Q8_0 on GPU):

llama-voxtral-stream.exe -m voxtral-realtime-4b-text-q8_0.gguf --mmproj voxtral-realtime-4b-mmproj-f16.gguf  -n 500 -t 8 -ngl 99 -c 2048

CPU-only mode (no GPU, for comparison):

llama-voxtral-stream.exe ^
  -m voxtral-realtime-4b-text-f16.gguf ^
  --mmproj voxtral-realtime-4b-mmproj-f16.gguf ^
  --image test01_20s.wav ^
  -n 500 -t 8 --no-mmproj-offload -ngl 0 -c 2048

CLI Flags Reference

Flag Description
-m <path> Text model GGUF file (required)
--mmproj <path> Audio encoder GGUF file (required)
--image <path> Audio file to transcribe (omit for live mic)
-n <N> Max tokens to generate per chunk (default: 500)
-t <N> CPU threads (default: 4)
-ngl <N> GPU layers to offload (99 or all for full GPU)
-c <N> Context size (default: 2048, sufficient for streaming)
--no-mmproj-offload Force audio encoder to CPU
--verbose-prompt Show full model-loading and debug output

VRAM Usage

Config Text Model Encoder KV Cache Total VRAM
GPU F32 ~13.1 GB ~1.9 GB ~0.2 GB ~15.2 GB
GPU F16 ~6.5 GB ~1.9 GB ~0.2 GB ~8.6 GB
GPU Q8_0 ~3.5 GB ~1.9 GB ~0.2 GB ~5.6 GB
Downloads last month
96
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support