Qwen3-TTS-Tokenizer-12Hz-48kHz

A fine-tuned variant of Qwen/Qwen3-TTS-Tokenizer-12Hz that decodes speech tokens to 48 kHz audio instead of the original 24 kHz β€” with no custom code required. Since only the decoder is fine-tuned, the codec remains fully compatible with the original tokenizer: you can drop this model in as a replacement decoder for Qwen3-TTS to obtain higher-quality audio output without any changes to the encoder or the TTS model itself.

Audio Samples

Sample audio encoded with Qwen/Qwen3-TTS-Tokenizer-12Hz and decoded with both the original 24 kHz model and this 48 kHz model.

Audio
Original
Hi-Fi-CAPTAIN (CC BY-NC-SA 4.0)
Reconstructed 24 kHz (base model)
Reconstructed 48 kHz (this model)

Mel spectrogram comparison

Mel spectrogram comparison

Overview

In the HuggingFace ecosystem, model weights are always distributed with a config.json that defines the architecture; transformers dynamically instantiates the model graph from these parameters before loading the weights β€” and Qwen3-TTS fully adheres to this convention. Its 12 Hz codec decoder performs 1920Γ— upsampling (12.5 Hz Γ— 1920 = 24 kHz) governed by the upsample_rates list in config.json. This 48 kHz variant achieves double the output sample rate purely by appending 2 to that list ([8, 5, 4, 3] β†’ [8, 5, 4, 3, 2]), letting transformers instantiate one additional DecoderBlock at load time β€” no code changes required.

Key properties:

Property Value
Input sample rate 24 kHz
Output sample rate 48 kHz
Added parameters ~95K (<0.1% of total)
model_type qwen3_tts_tokenizer_12hz (standard upstream)
Custom code required No

Usage

With Qwen3TTSTokenizer (encode / decode)

from qwen_tts import Qwen3TTSTokenizer
import soundfile as sf

# Load the 48 kHz model
tokenizer = Qwen3TTSTokenizer.from_pretrained("takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz")

# Encode audio to tokens
audio, sr = sf.read("input.wav", dtype="float32")
encoded = tokenizer.encode(audios=audio, sr=sr)

# Decode tokens back to 48 kHz audio
reconstructed, out_sr = tokenizer.decode(encoded)
# out_sr == 48000
sf.write("output_48k.wav", reconstructed[0], out_sr)

With Qwen3TTSModel (full TTS pipeline)

Replace the speech tokenizer inside Qwen3TTSModel to get 48 kHz output from the full TTS system:

from qwen_tts import Qwen3TTSModel, Qwen3TTSTokenizer
import torch
import IPython.display as ipd

# Load TTS model
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="auto",
    dtype=torch.bfloat16,
)

# Swap in the 48 kHz tokenizer
model.model.speech_tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz"
)

# Generate speech at 48 kHz
wavs, sr = model.generate_custom_voice(
    text="Hello, this is a 48 kHz TTS output.",
    language="English",
    speaker="Ono_Anna",
    instruct="A clear and natural female voice.",
)
# sr == 48000
ipd.display(ipd.Audio(wavs[0], rate=sr))

Direct decode from audio codes

import numpy as np
import torch
from qwen_tts import Qwen3TTSTokenizer

tokenizer = Qwen3TTSTokenizer.from_pretrained("takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz")

# audio_codes: numpy array of shape (seq_len, 16)
audio_codes = np.load("codes.npy")
codes_tensor = torch.tensor(audio_codes).unsqueeze(0)  # [1, seq_len, 16]

result = tokenizer.model.decode(codes_tensor)
# result.audio_values[0] is a 48 kHz waveform

Architecture

The decoder stack is extended by one block compared to the base model. All decoder modules are initialized from the base model weights and fine-tuned end-to-end:

Base model (24 kHz)             This model (48 kHz)
────────────────────            ──────────────────────────────
decoder[0] CausalConvNet        decoder[0] CausalConvNet        (initialized from base, fine-tuned)
decoder[1] DecoderBlock rate=8  decoder[1] DecoderBlock rate=8  (initialized from base, fine-tuned)
decoder[2] DecoderBlock rate=5  decoder[2] DecoderBlock rate=5  (initialized from base, fine-tuned)
decoder[3] DecoderBlock rate=4  decoder[3] DecoderBlock rate=4  (initialized from base, fine-tuned)
decoder[4] DecoderBlock rate=3  decoder[4] DecoderBlock rate=3  (initialized from base, fine-tuned)
decoder[5] SnakeBeta(96)        decoder[5] DecoderBlock rate=2  β˜… new block (randomly initialized, trained)
decoder[6] CausalConvNet(96β†’1)  decoder[6] SnakeBeta(48)        β˜… new (randomly initialized, trained)
                                decoder[7] CausalConvNet(48β†’1)  β˜… new (randomly initialized, trained)

The three new modules at the end (~95K parameters) are randomly initialized. The existing decoder modules are initialized from the base model weights and all trained together end-to-end.

Training

  • Base model: Qwen/Qwen3-TTS-Tokenizer-12Hz
  • Trained layers: all decoder modules (decoder[0–7], --num_frozen 0)
  • Loss functions: adversarial loss + feature matching loss + multi-resolution Mel loss + global RMS loss
  • Training framework: GAN-based (generator + discriminator), with accelerate + mixed precision (bf16)
  • Training data: 48 kHz speech audio (WebDataset format, with paired audio codes) approx 1.5K hours
  • Training machine/time: RTX5090 x 1, 120 hours

Training Code

Training Logs

See run_gan11 & run_gan12.

https://wandb.ai/takuma104/qwen3-tts-decoder-block-48k-gan

Limitations

  • The encoder operates at 24 kHz β€” input audio is internally downsampled to 24 kHz for encoding.
  • The decoder extends only the output stage; the latent representation and codebooks are identical to the base model.
  • Quality above 12 kHz is synthesized by the new decoder layers and may not perfectly reconstruct fine high-frequency details in all recording conditions.

Citation

If you use this model, please also cite the original Qwen3-TTS work.

License

Apache 2.0 β€” same as the base model.

Acknowledgements

The idea of asymmetrical sampling rate for input(encoder)/output(decoder): NandemoGHS/Anime-XCodec2-44.1kHz-v2

Downloads last month
2,260
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz

Finetuned
(1)
this model