TrevorJS
/

voxtral-tts-q4-gguf

@@ -19,14 +19,25 @@ language:
   - pt
   - ar
   - hi
-  - zh
-  - ja
 ---
 # Voxtral TTS Q4 GGUF
 Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) in GGUF format. For use with [voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs).
 ## Model Details
 - **Base model:** [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
@@ -37,60 +48,80 @@ Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Vox
 ## What is Quantized
-| Component | Tensors | Quantization |
-|-----------|---------|-------------|
-| Backbone (Ministral 3B, 26 layers) | attention + FFN weights | Q4_0 |
-| Flow-matching transformer (3 layers) | attention + FFN + projections | Q4_0 |
-| Token embeddings [131072, 3072] | | Q4_0 |
-| Semantic codebook output [8320, 3072] | | Q4_0 |
-| Codec decoder (8 transformer + 5 conv layers) | all weights | F32 |
-| RMSNorm, LayerScale, QK-norm | | F32 |
-| Audio codebook embeddings [9088, 3072] | | F32 |
-| VQ codebook [8192, 256] | | F32 |
-Codec weights are stored as F32 with pre-fused weight normalization (g * v / ||v||).
 ## Benchmarks
 NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":
-| Euler Steps | Gen Time | Audio | RTF | Quality (Whisper) |
-|-------------|----------|-------|-----|-------------------|
-| 8 (default) | 5.4s | 3.36s | 1.61x | Perfect |
-| 4 | 3.7s | 2.96s | 1.24x | Perfect |
-| **3** | **3.2s** | **2.64s** | **~1.0x** | **Perfect** |
-RTF < 1.0 = faster than real-time. Quality verified with Whisper large-v3.
 ## Usage
-### Native (Rust)
 ```bash
 # Download
 uv run --with huggingface_hub \
   hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models
-# Use via Rust API (Q4 TTS CLI coming soon)
-use voxtral_mini_realtime::gguf::Q4TtsModelLoader;
-let mut loader = Q4TtsModelLoader::from_file(Path::new("models/voxtral-tts-q4.gguf"))?;
-let (backbone, fm, codec) = loader.load(&device)?;
 ```
 ### Browser (WASM + WebGPU)
-Split into shards for browser loading:
 ```bash
-mkdir -p models/voxtral-tts-q4-shards
-split -b 512m models/voxtral-tts-q4.gguf models/voxtral-tts-q4-shards/shard-
 ```
-Voice embeddings are loaded separately from [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) (`voice_embedding/*.safetensors`).
-## Quantization Script
-Generated with:
 ```bash
 uv run --with safetensors --with torch --with numpy --with packaging \
@@ -98,3 +129,10 @@ uv run --with safetensors --with torch --with numpy --with packaging \
 ```
 Source: [scripts/quantize_tts_gguf.py](https://github.com/TrevorS/voxtral-mini-realtime-rs/blob/main/scripts/quantize_tts_gguf.py)

   - pt
   - ar
   - hi
+  - it
+  - nl
 ---
 # Voxtral TTS Q4 GGUF
 Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) in GGUF format. For use with [voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs).
+**[Try the browser demo](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts)** — runs entirely client-side via WASM + WebGPU.
+## Files
+| File | Size | Description |
+|------|------|-------------|
+| `voxtral-tts-q4.gguf` | 2.67 GB | Full Q4 model (single file, for native use) |
+| `shard-{aa..af}` | 6 × ≤512 MB | Sharded for browser (WASM ArrayBuffer limit) |
+| `voice_embedding/*.safetensors` | ~50-200 KB each | 20 voice presets across 9 languages |
+| `tekken.json` | 14.9 MB | Tekken BPE tokenizer |
 ## Model Details
 - **Base model:** [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
 ## What is Quantized
+| Component | Quantization |
+|-----------|-------------|
+| Backbone (Ministral 3B, 26 layers) — attention + FFN | Q4_0 |
+| Flow-matching transformer (3 layers) — attention + FFN + projections | Q4_0 |
+| Token embeddings [131072, 3072] | Q4_0 |
+| Semantic codebook output [8320, 3072] | Q4_0 |
+| Codec decoder (8 transformer + 5 conv layers) | F32 |
+| RMSNorm, LayerScale, QK-norm, small projections | F32 |
+| Audio codebook embeddings [9088, 3072] | F32 |
+Codec weights stored as F32 with pre-fused weight normalization.
 ## Benchmarks
 NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":
+| Euler Steps | RTF | Quality (Whisper large-v3) |
+|-------------|-----|---------------------------|
+| 8 (default) | 1.61x | Perfect |
+| 4 | 1.24x | Perfect |
+| **3** | **~1.0x** (real-time) | **Perfect** |
+Optimizations: batched CFG, fused QKV+gate/up projections, pre-allocated KV cache.
 ## Usage
+### Native CLI
 ```bash
 # Download
 uv run --with huggingface_hub \
   hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models
+# Synthesize (unified voxtral CLI)
+cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
+  speak --text "Hello world" --voice casual_female --gguf models/voxtral-tts-q4.gguf
+# Real-time with 3 Euler steps
+cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
+  speak --text "Hello world" --gguf models/voxtral-tts-q4.gguf --euler-steps 3
+# List voices
+cargo run --release --features "wgpu,cli,hub" --bin voxtral -- speak --list-voices
 ```
 ### Browser (WASM + WebGPU)
+Shards are pre-split for browser loading. The [TTS demo](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts) loads them automatically.
+For local dev:
 ```bash
+wasm-pack build --target web --no-default-features --features wasm
+bun serve.mjs  # serves shards from models/voxtral-tts-q4-shards/
 ```
+### Available Voices
+20 presets across 9 languages:
+| Voice | Language |
+|-------|----------|
+| casual_female, casual_male | English |
+| neutral_female, neutral_male | English |
+| cheerful_female | English |
+| fr_female, fr_male | French |
+| de_female, de_male | German |
+| es_female, es_male | Spanish |
+| it_female, it_male | Italian |
+| pt_female, pt_male | Portuguese |
+| nl_female, nl_male | Dutch |
+| hi_female, hi_male | Hindi |
+| ar_male | Arabic |
+## Quantization Script
 ```bash
 uv run --with safetensors --with torch --with numpy --with packaging \
 ```
 Source: [scripts/quantize_tts_gguf.py](https://github.com/TrevorS/voxtral-mini-realtime-rs/blob/main/scripts/quantize_tts_gguf.py)
+## Related
+- **Code:** [TrevorS/voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs)
+- **ASR Model:** [TrevorJS/voxtral-mini-realtime-gguf](https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf)
+- **ASR Demo:** [TrevorJS/voxtral-mini-realtime](https://huggingface.co/spaces/TrevorJS/voxtral-mini-realtime)
+- **TTS Demo:** [TrevorJS/voxtral-4b-tts](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts)