OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Paper • 2604.00688 • Published • 8
Fine-tuned version of OmniVoice on 1,000 hours of high-quality Vietnamese speech data, optimized for Vietnamese voice cloning and text-to-speech.
| Parameter | Value |
|---|---|
| Base model | k2-fsa/OmniVoice (0.6B params) |
| Architecture | Diffusion Language Model (non-autoregressive, iterative masked decoding) |
| Training steps | 8,000 |
| Training time | 6 hours |
| Hardware | NVIDIA H200 SXM (141 GB VRAM) |
| Output sample rate | 24,000 Hz |
Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus
| Property | Value |
|---|---|
| Duration | ~1,000 hours |
| Samples | 664,125 |
| Speakers | 152 (multi-region, diverse accents) |
| Language | Vietnamese (100%) |
| Audio duration | 0.63 – 32.1 seconds per sample |
| Quality | Cleaned, noise-free, sentence-level boundary trimming |
| Domains | News, entertainment, education, conversational |
| License | CC-BY-NC-SA-4.0 (research use only) |
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
pip install omnivoice
import torch
import torchaudio
from omnivoice import OmniVoice
# Load the Vietnamese fine-tuned model
model = OmniVoice.from_pretrained(
"splendor1811/omnivoice-vietnamese",
device_map="cuda:0",
dtype=torch.float16,
)
# Zero-shot voice cloning
audio = model.generate(
text="Xin chào, đây là mô hình tổng hợp giọng nói tiếng Việt.",
language="vietnamese",
ref_audio="reference.wav",
ref_text="Transcript of the reference audio.",
)
torchaudio.save("output.wav", audio[0], 24000)
# Create voice prompt once (caches the encoded reference audio)
voice_prompt = model.create_voice_clone_prompt(
ref_audio="reference.wav",
ref_text="Transcript of the reference audio.",
)
# Reuse for multiple generations — no re-encoding cost
audio = model.generate(
text="Em chào anh, em gọi từ tổng đài ngân hàng.",
language="vietnamese",
voice_clone_prompt=voice_prompt,
)
from omnivoice import OmniVoiceGenerationConfig
# Apply torch.compile for faster inference
torch.set_float32_matmul_precision("high")
model.llm = torch.compile(model.llm, mode="reduce-overhead", dynamic=True)
# Warmup (triggers compilation)
config = OmniVoiceGenerationConfig(num_step=8, guidance_scale=2.0)
for _ in range(3):
model.generate(
text="Xin chào.",
language="vietnamese",
voice_clone_prompt=voice_prompt,
generation_config=config,
)
# Production inference at num_step=8 for speed
audio = model.generate(
text="Dạ chào anh, anh có cần hỗ trợ gì không ạ?",
language="vietnamese",
voice_clone_prompt=voice_prompt,
generation_config=config,
)
Benchmarked on NVIDIA L4 GPU with num_step=8 and torch.compile(mode="reduce-overhead"):
| Metric | Value |
|---|---|
| RTF (Real-Time Factor) | ~0.07 (14x real-time) |
| TTFB (short sentence) | ~210 ms |
| P95 TTFB at CCU=4 | ~1.26 s |
| Sample rate | 24,000 Hz |
OmniVoice is a massively multilingual zero-shot TTS model supporting 600+ languages, built on a diffusion language model architecture with Qwen3-0.6B as the backbone.
Key features of the base model:
[laughter], [breath], etc.@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
@dataset{dolly_audio_2025,
title={Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus},
author={Nguyen, Vinh Huy and Nguyen, Dinh Thuan},
year={2025},
publisher={Dolly AI Team},
howpublished={\url{https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese}},
note={Released under CC-BY-NC-SA-4.0. Research use only.}
}