Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -19,14 +19,25 @@ language:
|
|
| 19 |
- pt
|
| 20 |
- ar
|
| 21 |
- hi
|
| 22 |
-
-
|
| 23 |
-
-
|
| 24 |
---
|
| 25 |
|
| 26 |
# Voxtral TTS Q4 GGUF
|
| 27 |
|
| 28 |
Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) in GGUF format. For use with [voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs).
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## Model Details
|
| 31 |
|
| 32 |
- **Base model:** [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
|
|
@@ -37,60 +48,80 @@ Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Vox
|
|
| 37 |
|
| 38 |
## What is Quantized
|
| 39 |
|
| 40 |
-
| Component |
|
| 41 |
-
|-----------|---------
|
| 42 |
-
| Backbone (Ministral 3B, 26 layers)
|
| 43 |
-
| Flow-matching transformer (3 layers)
|
| 44 |
-
| Token embeddings [131072, 3072] |
|
| 45 |
-
| Semantic codebook output [8320, 3072] |
|
| 46 |
-
| Codec decoder (8 transformer + 5 conv layers) |
|
| 47 |
-
| RMSNorm, LayerScale, QK-norm
|
| 48 |
-
| Audio codebook embeddings [9088, 3072] |
|
| 49 |
-
| VQ codebook [8192, 256] | | F32 |
|
| 50 |
|
| 51 |
-
Codec weights
|
| 52 |
|
| 53 |
## Benchmarks
|
| 54 |
|
| 55 |
NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":
|
| 56 |
|
| 57 |
-
| Euler Steps |
|
| 58 |
-
|-------------|-----
|
| 59 |
-
| 8 (default) |
|
| 60 |
-
| 4 |
|
| 61 |
-
| **3** | **
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
## Usage
|
| 66 |
|
| 67 |
-
### Native
|
| 68 |
|
| 69 |
```bash
|
| 70 |
# Download
|
| 71 |
uv run --with huggingface_hub \
|
| 72 |
hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models
|
| 73 |
|
| 74 |
-
#
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
```
|
| 79 |
|
| 80 |
### Browser (WASM + WebGPU)
|
| 81 |
|
| 82 |
-
|
| 83 |
|
|
|
|
| 84 |
```bash
|
| 85 |
-
|
| 86 |
-
|
| 87 |
```
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
```bash
|
| 96 |
uv run --with safetensors --with torch --with numpy --with packaging \
|
|
@@ -98,3 +129,10 @@ uv run --with safetensors --with torch --with numpy --with packaging \
|
|
| 98 |
```
|
| 99 |
|
| 100 |
Source: [scripts/quantize_tts_gguf.py](https://github.com/TrevorS/voxtral-mini-realtime-rs/blob/main/scripts/quantize_tts_gguf.py)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
- pt
|
| 20 |
- ar
|
| 21 |
- hi
|
| 22 |
+
- it
|
| 23 |
+
- nl
|
| 24 |
---
|
| 25 |
|
| 26 |
# Voxtral TTS Q4 GGUF
|
| 27 |
|
| 28 |
Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) in GGUF format. For use with [voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs).
|
| 29 |
|
| 30 |
+
**[Try the browser demo](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts)** — runs entirely client-side via WASM + WebGPU.
|
| 31 |
+
|
| 32 |
+
## Files
|
| 33 |
+
|
| 34 |
+
| File | Size | Description |
|
| 35 |
+
|------|------|-------------|
|
| 36 |
+
| `voxtral-tts-q4.gguf` | 2.67 GB | Full Q4 model (single file, for native use) |
|
| 37 |
+
| `shard-{aa..af}` | 6 × ≤512 MB | Sharded for browser (WASM ArrayBuffer limit) |
|
| 38 |
+
| `voice_embedding/*.safetensors` | ~50-200 KB each | 20 voice presets across 9 languages |
|
| 39 |
+
| `tekken.json` | 14.9 MB | Tekken BPE tokenizer |
|
| 40 |
+
|
| 41 |
## Model Details
|
| 42 |
|
| 43 |
- **Base model:** [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
|
|
|
|
| 48 |
|
| 49 |
## What is Quantized
|
| 50 |
|
| 51 |
+
| Component | Quantization |
|
| 52 |
+
|-----------|-------------|
|
| 53 |
+
| Backbone (Ministral 3B, 26 layers) — attention + FFN | Q4_0 |
|
| 54 |
+
| Flow-matching transformer (3 layers) — attention + FFN + projections | Q4_0 |
|
| 55 |
+
| Token embeddings [131072, 3072] | Q4_0 |
|
| 56 |
+
| Semantic codebook output [8320, 3072] | Q4_0 |
|
| 57 |
+
| Codec decoder (8 transformer + 5 conv layers) | F32 |
|
| 58 |
+
| RMSNorm, LayerScale, QK-norm, small projections | F32 |
|
| 59 |
+
| Audio codebook embeddings [9088, 3072] | F32 |
|
|
|
|
| 60 |
|
| 61 |
+
Codec weights stored as F32 with pre-fused weight normalization.
|
| 62 |
|
| 63 |
## Benchmarks
|
| 64 |
|
| 65 |
NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":
|
| 66 |
|
| 67 |
+
| Euler Steps | RTF | Quality (Whisper large-v3) |
|
| 68 |
+
|-------------|-----|---------------------------|
|
| 69 |
+
| 8 (default) | 1.61x | Perfect |
|
| 70 |
+
| 4 | 1.24x | Perfect |
|
| 71 |
+
| **3** | **~1.0x** (real-time) | **Perfect** |
|
| 72 |
|
| 73 |
+
Optimizations: batched CFG, fused QKV+gate/up projections, pre-allocated KV cache.
|
| 74 |
|
| 75 |
## Usage
|
| 76 |
|
| 77 |
+
### Native CLI
|
| 78 |
|
| 79 |
```bash
|
| 80 |
# Download
|
| 81 |
uv run --with huggingface_hub \
|
| 82 |
hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models
|
| 83 |
|
| 84 |
+
# Synthesize (unified voxtral CLI)
|
| 85 |
+
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
|
| 86 |
+
speak --text "Hello world" --voice casual_female --gguf models/voxtral-tts-q4.gguf
|
| 87 |
+
|
| 88 |
+
# Real-time with 3 Euler steps
|
| 89 |
+
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
|
| 90 |
+
speak --text "Hello world" --gguf models/voxtral-tts-q4.gguf --euler-steps 3
|
| 91 |
+
|
| 92 |
+
# List voices
|
| 93 |
+
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- speak --list-voices
|
| 94 |
```
|
| 95 |
|
| 96 |
### Browser (WASM + WebGPU)
|
| 97 |
|
| 98 |
+
Shards are pre-split for browser loading. The [TTS demo](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts) loads them automatically.
|
| 99 |
|
| 100 |
+
For local dev:
|
| 101 |
```bash
|
| 102 |
+
wasm-pack build --target web --no-default-features --features wasm
|
| 103 |
+
bun serve.mjs # serves shards from models/voxtral-tts-q4-shards/
|
| 104 |
```
|
| 105 |
|
| 106 |
+
### Available Voices
|
| 107 |
|
| 108 |
+
20 presets across 9 languages:
|
| 109 |
+
|
| 110 |
+
| Voice | Language |
|
| 111 |
+
|-------|----------|
|
| 112 |
+
| casual_female, casual_male | English |
|
| 113 |
+
| neutral_female, neutral_male | English |
|
| 114 |
+
| cheerful_female | English |
|
| 115 |
+
| fr_female, fr_male | French |
|
| 116 |
+
| de_female, de_male | German |
|
| 117 |
+
| es_female, es_male | Spanish |
|
| 118 |
+
| it_female, it_male | Italian |
|
| 119 |
+
| pt_female, pt_male | Portuguese |
|
| 120 |
+
| nl_female, nl_male | Dutch |
|
| 121 |
+
| hi_female, hi_male | Hindi |
|
| 122 |
+
| ar_male | Arabic |
|
| 123 |
|
| 124 |
+
## Quantization Script
|
| 125 |
|
| 126 |
```bash
|
| 127 |
uv run --with safetensors --with torch --with numpy --with packaging \
|
|
|
|
| 129 |
```
|
| 130 |
|
| 131 |
Source: [scripts/quantize_tts_gguf.py](https://github.com/TrevorS/voxtral-mini-realtime-rs/blob/main/scripts/quantize_tts_gguf.py)
|
| 132 |
+
|
| 133 |
+
## Related
|
| 134 |
+
|
| 135 |
+
- **Code:** [TrevorS/voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs)
|
| 136 |
+
- **ASR Model:** [TrevorJS/voxtral-mini-realtime-gguf](https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf)
|
| 137 |
+
- **ASR Demo:** [TrevorJS/voxtral-mini-realtime](https://huggingface.co/spaces/TrevorJS/voxtral-mini-realtime)
|
| 138 |
+
- **TTS Demo:** [TrevorJS/voxtral-4b-tts](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts)
|