OmniVoice Vietnamese — Fine-tuned for Vietnamese Speech Synthesis

Fine-tuned version of OmniVoice on 1,000 hours of high-quality Vietnamese speech data, optimized for Vietnamese voice cloning and text-to-speech.

Model: splendor1811/omnivoice-vietnamese
Base model: k2-fsa/OmniVoice (Qwen3-0.6B backbone, diffusion language model)
Dataset: dolly-vn/dolly-audio-1000h-vietnamese
License: Apache 2.0 (model), CC-BY-NC-SA-4.0 (dataset)

Training Details

Parameter	Value
Base model	k2-fsa/OmniVoice (0.6B params)
Architecture	Diffusion Language Model (non-autoregressive, iterative masked decoding)
Training steps	8,000
Training time	6 hours
Hardware	NVIDIA H200 SXM (141 GB VRAM)
Output sample rate	24,000 Hz

Dataset

Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus

Property	Value
Duration	~1,000 hours
Samples	664,125
Speakers	152 (multi-region, diverse accents)
Language	Vietnamese (100%)
Audio duration	0.63 – 32.1 seconds per sample
Quality	Cleaned, noise-free, sentence-level boundary trimming
Domains	News, entertainment, education, conversational
License	CC-BY-NC-SA-4.0 (research use only)

Installation

Step 1: Install PyTorch (NVIDIA GPU)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

Step 2: Install OmniVoice

pip install omnivoice

Usage

Python API

import torch
import torchaudio
from omnivoice import OmniVoice

# Load the Vietnamese fine-tuned model
model = OmniVoice.from_pretrained(
    "splendor1811/omnivoice-vietnamese",
    device_map="cuda:0",
    dtype=torch.float16,
)

# Zero-shot voice cloning
audio = model.generate(
    text="Xin chào, đây là mô hình tổng hợp giọng nói tiếng Việt.",
    language="vietnamese",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
)

torchaudio.save("output.wav", audio[0], 24000)

With Cached Voice Prompt (recommended for serving)

# Create voice prompt once (caches the encoded reference audio)
voice_prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
)

# Reuse for multiple generations — no re-encoding cost
audio = model.generate(
    text="Em chào anh, em gọi từ tổng đài ngân hàng.",
    language="vietnamese",
    voice_clone_prompt=voice_prompt,
)

With torch.compile (recommended for production)

from omnivoice import OmniVoiceGenerationConfig

# Apply torch.compile for faster inference
torch.set_float32_matmul_precision("high")
model.llm = torch.compile(model.llm, mode="reduce-overhead", dynamic=True)

# Warmup (triggers compilation)
config = OmniVoiceGenerationConfig(num_step=8, guidance_scale=2.0)
for _ in range(3):
    model.generate(
        text="Xin chào.",
        language="vietnamese",
        voice_clone_prompt=voice_prompt,
        generation_config=config,
    )

# Production inference at num_step=8 for speed
audio = model.generate(
    text="Dạ chào anh, anh có cần hỗ trợ gì không ạ?",
    language="vietnamese",
    voice_clone_prompt=voice_prompt,
    generation_config=config,
)

Performance

Benchmarked on NVIDIA L4 GPU with num_step=8 and torch.compile(mode="reduce-overhead"):

Metric	Value
RTF (Real-Time Factor)	~0.07 (14x real-time)
TTFB (short sentence)	~210 ms
P95 TTFB at CCU=4	~1.26 s
Sample rate	24,000 Hz

Base Model: OmniVoice

OmniVoice is a massively multilingual zero-shot TTS model supporting 600+ languages, built on a diffusion language model architecture with Qwen3-0.6B as the backbone.

Key features of the base model:

600+ languages — broadest coverage among zero-shot TTS models
Voice cloning — state-of-the-art from short reference audio
Voice design — control via speaker attributes (gender, age, pitch, accent)
Non-verbal symbols — [laughter], [breath], etc.
Fast inference — RTF as low as 0.025 (40x real-time)

Citation

@article{zhu2026omnivoice,
  title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
  author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
  journal={arXiv preprint arXiv:2604.00688},
  year={2026}
}

@dataset{dolly_audio_2025,
  title={Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus},
  author={Nguyen, Vinh Huy and Nguyen, Dinh Thuan},
  year={2025},
  publisher={Dolly AI Team},
  howpublished={\url{https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese}},
  note={Released under CC-BY-NC-SA-4.0. Research use only.}
}