Qwen3-TTS-Tokenizer-12Hz-48kHz
A fine-tuned variant of Qwen/Qwen3-TTS-Tokenizer-12Hz that decodes speech tokens to 48 kHz audio instead of the original 24 kHz β with no custom code required. Since only the decoder is fine-tuned, the codec remains fully compatible with the original tokenizer: you can drop this model in as a replacement decoder for Qwen3-TTS to obtain higher-quality audio output without any changes to the encoder or the TTS model itself.
Audio Samples
Sample audio encoded with Qwen/Qwen3-TTS-Tokenizer-12Hz and decoded with both the original 24 kHz model and this 48 kHz model.
| Audio | |
|---|---|
| Original Hi-Fi-CAPTAIN (CC BY-NC-SA 4.0) |
|
| Reconstructed 24 kHz (base model) | |
| Reconstructed 48 kHz (this model) |
Mel spectrogram comparison
Overview
In the HuggingFace ecosystem, model weights are always distributed with a config.json that defines the architecture; transformers dynamically instantiates the model graph from these parameters before loading the weights β and Qwen3-TTS fully adheres to this convention. Its 12 Hz codec decoder performs 1920Γ upsampling (12.5 Hz Γ 1920 = 24 kHz) governed by the upsample_rates list in config.json. This 48 kHz variant achieves double the output sample rate purely by appending 2 to that list ([8, 5, 4, 3] β [8, 5, 4, 3, 2]), letting transformers instantiate one additional DecoderBlock at load time β no code changes required.
Key properties:
| Property | Value |
|---|---|
| Input sample rate | 24 kHz |
| Output sample rate | 48 kHz |
| Added parameters | ~95K (<0.1% of total) |
model_type |
qwen3_tts_tokenizer_12hz (standard upstream) |
| Custom code required | No |
Usage
With Qwen3TTSTokenizer (encode / decode)
from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf
# Load the 48 kHz model
tokenizer = Qwen3TTSTokenizer.from_pretrained("takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz")
# Encode audio to tokens
audio, sr = sf.read("input.wav", dtype="float32")
encoded = tokenizer.encode(audios=audio, sr=sr)
# Decode tokens back to 48 kHz audio
reconstructed, out_sr = tokenizer.decode(encoded)
# out_sr == 48000
sf.write("output_48k.wav", reconstructed[0], out_sr)
With Qwen3TTSModel (full TTS pipeline)
Replace the speech tokenizer inside Qwen3TTSModel to get 48 kHz output from the full TTS system:
from qwen_tts import Qwen3TTSModel, Qwen3TTSTokenizer
import torch
import IPython.display as ipd
# Load TTS model
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="auto",
dtype=torch.bfloat16,
)
# Swap in the 48 kHz tokenizer
model.model.speech_tokenizer = Qwen3TTSTokenizer.from_pretrained(
"takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz"
)
# Generate speech at 48 kHz
wavs, sr = model.generate_custom_voice(
text="Hello, this is a 48 kHz TTS output.",
language="English",
speaker="Ono_Anna",
instruct="A clear and natural female voice.",
)
# sr == 48000
ipd.display(ipd.Audio(wavs[0], rate=sr))
Direct decode from audio codes
import numpy as np
import torch
from qwen_tts import Qwen3TTSTokenizer
tokenizer = Qwen3TTSTokenizer.from_pretrained("takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz")
# audio_codes: numpy array of shape (seq_len, 16)
audio_codes = np.load("codes.npy")
codes_tensor = torch.tensor(audio_codes).unsqueeze(0) # [1, seq_len, 16]
result = tokenizer.model.decode(codes_tensor)
# result.audio_values[0] is a 48 kHz waveform
Architecture
The decoder stack is extended by one block compared to the base model. All decoder modules are initialized from the base model weights and fine-tuned end-to-end:
Base model (24 kHz) This model (48 kHz)
ββββββββββββββββββββ ββββββββββββββββββββββββββββββ
decoder[0] CausalConvNet decoder[0] CausalConvNet (initialized from base, fine-tuned)
decoder[1] DecoderBlock rate=8 decoder[1] DecoderBlock rate=8 (initialized from base, fine-tuned)
decoder[2] DecoderBlock rate=5 decoder[2] DecoderBlock rate=5 (initialized from base, fine-tuned)
decoder[3] DecoderBlock rate=4 decoder[3] DecoderBlock rate=4 (initialized from base, fine-tuned)
decoder[4] DecoderBlock rate=3 decoder[4] DecoderBlock rate=3 (initialized from base, fine-tuned)
decoder[5] SnakeBeta(96) decoder[5] DecoderBlock rate=2 β
new block (randomly initialized, trained)
decoder[6] CausalConvNet(96β1) decoder[6] SnakeBeta(48) β
new (randomly initialized, trained)
decoder[7] CausalConvNet(48β1) β
new (randomly initialized, trained)
The three new modules at the end (~95K parameters) are randomly initialized. The existing decoder modules are initialized from the base model weights and all trained together end-to-end.
Training
- Base model:
Qwen/Qwen3-TTS-Tokenizer-12Hz - Trained layers: all decoder modules (decoder[0β7],
--num_frozen 0) - Loss functions: adversarial loss + feature matching loss + multi-resolution Mel loss + global RMS loss
- Training framework: GAN-based (generator + discriminator), with
accelerate+ mixed precision (bf16) - Training data: 48 kHz speech audio (WebDataset format, with paired audio codes) approx 1.5K hours
- Training machine/time: RTX5090 x 1, 120 hours
Training Code
- Obsolete: The training code for currently release model weight
- New codebase: https://github.com/takuma104/Qwen3-TTS-Tokenizer-12Hz-Trainer
Training Logs
See run_gan11 & run_gan12.
https://wandb.ai/takuma104/qwen3-tts-decoder-block-48k-gan
Limitations
- The encoder operates at 24 kHz β input audio is internally downsampled to 24 kHz for encoding.
- The decoder extends only the output stage; the latent representation and codebooks are identical to the base model.
- Quality above 12 kHz is synthesized by the new decoder layers and may not perfectly reconstruct fine high-frequency details in all recording conditions.
Citation
If you use this model, please also cite the original Qwen3-TTS work.
License
Apache 2.0 β same as the base model.
Acknowledgements
The idea of asymmetrical sampling rate for input(encoder)/output(decoder): NandemoGHS/Anime-XCodec2-44.1kHz-v2
- Downloads last month
- 2,260
Model tree for takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz
Base model
Qwen/Qwen3-TTS-Tokenizer-12Hz