TrevorJS commited on
Commit
e5bc231
·
verified ·
1 Parent(s): bb08bb9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +68 -30
README.md CHANGED
@@ -19,14 +19,25 @@ language:
19
  - pt
20
  - ar
21
  - hi
22
- - zh
23
- - ja
24
  ---
25
 
26
  # Voxtral TTS Q4 GGUF
27
 
28
  Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) in GGUF format. For use with [voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs).
29
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Model Details
31
 
32
  - **Base model:** [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
@@ -37,60 +48,80 @@ Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Vox
37
 
38
  ## What is Quantized
39
 
40
- | Component | Tensors | Quantization |
41
- |-----------|---------|-------------|
42
- | Backbone (Ministral 3B, 26 layers) | attention + FFN weights | Q4_0 |
43
- | Flow-matching transformer (3 layers) | attention + FFN + projections | Q4_0 |
44
- | Token embeddings [131072, 3072] | | Q4_0 |
45
- | Semantic codebook output [8320, 3072] | | Q4_0 |
46
- | Codec decoder (8 transformer + 5 conv layers) | all weights | F32 |
47
- | RMSNorm, LayerScale, QK-norm | | F32 |
48
- | Audio codebook embeddings [9088, 3072] | | F32 |
49
- | VQ codebook [8192, 256] | | F32 |
50
 
51
- Codec weights are stored as F32 with pre-fused weight normalization (g * v / ||v||).
52
 
53
  ## Benchmarks
54
 
55
  NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":
56
 
57
- | Euler Steps | Gen Time | Audio | RTF | Quality (Whisper) |
58
- |-------------|----------|-------|-----|-------------------|
59
- | 8 (default) | 5.4s | 3.36s | 1.61x | Perfect |
60
- | 4 | 3.7s | 2.96s | 1.24x | Perfect |
61
- | **3** | **3.2s** | **2.64s** | **~1.0x** | **Perfect** |
62
 
63
- RTF < 1.0 = faster than real-time. Quality verified with Whisper large-v3.
64
 
65
  ## Usage
66
 
67
- ### Native (Rust)
68
 
69
  ```bash
70
  # Download
71
  uv run --with huggingface_hub \
72
  hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models
73
 
74
- # Use via Rust API (Q4 TTS CLI coming soon)
75
- use voxtral_mini_realtime::gguf::Q4TtsModelLoader;
76
- let mut loader = Q4TtsModelLoader::from_file(Path::new("models/voxtral-tts-q4.gguf"))?;
77
- let (backbone, fm, codec) = loader.load(&device)?;
 
 
 
 
 
 
78
  ```
79
 
80
  ### Browser (WASM + WebGPU)
81
 
82
- Split into shards for browser loading:
83
 
 
84
  ```bash
85
- mkdir -p models/voxtral-tts-q4-shards
86
- split -b 512m models/voxtral-tts-q4.gguf models/voxtral-tts-q4-shards/shard-
87
  ```
88
 
89
- Voice embeddings are loaded separately from [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) (`voice_embedding/*.safetensors`).
90
 
91
- ## Quantization Script
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
- Generated with:
94
 
95
  ```bash
96
  uv run --with safetensors --with torch --with numpy --with packaging \
@@ -98,3 +129,10 @@ uv run --with safetensors --with torch --with numpy --with packaging \
98
  ```
99
 
100
  Source: [scripts/quantize_tts_gguf.py](https://github.com/TrevorS/voxtral-mini-realtime-rs/blob/main/scripts/quantize_tts_gguf.py)
 
 
 
 
 
 
 
 
19
  - pt
20
  - ar
21
  - hi
22
+ - it
23
+ - nl
24
  ---
25
 
26
  # Voxtral TTS Q4 GGUF
27
 
28
  Q4_0 quantized weights for [Voxtral 4B TTS](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) in GGUF format. For use with [voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs).
29
 
30
+ **[Try the browser demo](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts)** — runs entirely client-side via WASM + WebGPU.
31
+
32
+ ## Files
33
+
34
+ | File | Size | Description |
35
+ |------|------|-------------|
36
+ | `voxtral-tts-q4.gguf` | 2.67 GB | Full Q4 model (single file, for native use) |
37
+ | `shard-{aa..af}` | 6 × ≤512 MB | Sharded for browser (WASM ArrayBuffer limit) |
38
+ | `voice_embedding/*.safetensors` | ~50-200 KB each | 20 voice presets across 9 languages |
39
+ | `tekken.json` | 14.9 MB | Tekken BPE tokenizer |
40
+
41
  ## Model Details
42
 
43
  - **Base model:** [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
 
48
 
49
  ## What is Quantized
50
 
51
+ | Component | Quantization |
52
+ |-----------|-------------|
53
+ | Backbone (Ministral 3B, 26 layers) attention + FFN | Q4_0 |
54
+ | Flow-matching transformer (3 layers) attention + FFN + projections | Q4_0 |
55
+ | Token embeddings [131072, 3072] | Q4_0 |
56
+ | Semantic codebook output [8320, 3072] | Q4_0 |
57
+ | Codec decoder (8 transformer + 5 conv layers) | F32 |
58
+ | RMSNorm, LayerScale, QK-norm, small projections | F32 |
59
+ | Audio codebook embeddings [9088, 3072] | F32 |
 
60
 
61
+ Codec weights stored as F32 with pre-fused weight normalization.
62
 
63
  ## Benchmarks
64
 
65
  NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":
66
 
67
+ | Euler Steps | RTF | Quality (Whisper large-v3) |
68
+ |-------------|-----|---------------------------|
69
+ | 8 (default) | 1.61x | Perfect |
70
+ | 4 | 1.24x | Perfect |
71
+ | **3** | **~1.0x** (real-time) | **Perfect** |
72
 
73
+ Optimizations: batched CFG, fused QKV+gate/up projections, pre-allocated KV cache.
74
 
75
  ## Usage
76
 
77
+ ### Native CLI
78
 
79
  ```bash
80
  # Download
81
  uv run --with huggingface_hub \
82
  hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models
83
 
84
+ # Synthesize (unified voxtral CLI)
85
+ cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
86
+ speak --text "Hello world" --voice casual_female --gguf models/voxtral-tts-q4.gguf
87
+
88
+ # Real-time with 3 Euler steps
89
+ cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
90
+ speak --text "Hello world" --gguf models/voxtral-tts-q4.gguf --euler-steps 3
91
+
92
+ # List voices
93
+ cargo run --release --features "wgpu,cli,hub" --bin voxtral -- speak --list-voices
94
  ```
95
 
96
  ### Browser (WASM + WebGPU)
97
 
98
+ Shards are pre-split for browser loading. The [TTS demo](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts) loads them automatically.
99
 
100
+ For local dev:
101
  ```bash
102
+ wasm-pack build --target web --no-default-features --features wasm
103
+ bun serve.mjs # serves shards from models/voxtral-tts-q4-shards/
104
  ```
105
 
106
+ ### Available Voices
107
 
108
+ 20 presets across 9 languages:
109
+
110
+ | Voice | Language |
111
+ |-------|----------|
112
+ | casual_female, casual_male | English |
113
+ | neutral_female, neutral_male | English |
114
+ | cheerful_female | English |
115
+ | fr_female, fr_male | French |
116
+ | de_female, de_male | German |
117
+ | es_female, es_male | Spanish |
118
+ | it_female, it_male | Italian |
119
+ | pt_female, pt_male | Portuguese |
120
+ | nl_female, nl_male | Dutch |
121
+ | hi_female, hi_male | Hindi |
122
+ | ar_male | Arabic |
123
 
124
+ ## Quantization Script
125
 
126
  ```bash
127
  uv run --with safetensors --with torch --with numpy --with packaging \
 
129
  ```
130
 
131
  Source: [scripts/quantize_tts_gguf.py](https://github.com/TrevorS/voxtral-mini-realtime-rs/blob/main/scripts/quantize_tts_gguf.py)
132
+
133
+ ## Related
134
+
135
+ - **Code:** [TrevorS/voxtral-mini-realtime-rs](https://github.com/TrevorS/voxtral-mini-realtime-rs)
136
+ - **ASR Model:** [TrevorJS/voxtral-mini-realtime-gguf](https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf)
137
+ - **ASR Demo:** [TrevorJS/voxtral-mini-realtime](https://huggingface.co/spaces/TrevorJS/voxtral-mini-realtime)
138
+ - **TTS Demo:** [TrevorJS/voxtral-4b-tts](https://huggingface.co/spaces/TrevorJS/voxtral-4b-tts)