OmniVoice Vietnamese — Fine-tuned for Vietnamese Speech Synthesis

Fine-tuned version of OmniVoice on 1,000 hours of high-quality Vietnamese speech data, optimized for Vietnamese voice cloning and text-to-speech.

Training Details

Parameter Value
Base model k2-fsa/OmniVoice (0.6B params)
Architecture Diffusion Language Model (non-autoregressive, iterative masked decoding)
Training steps 8,000
Training time 6 hours
Hardware NVIDIA H200 SXM (141 GB VRAM)
Output sample rate 24,000 Hz

Dataset

Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus

Property Value
Duration ~1,000 hours
Samples 664,125
Speakers 152 (multi-region, diverse accents)
Language Vietnamese (100%)
Audio duration 0.63 – 32.1 seconds per sample
Quality Cleaned, noise-free, sentence-level boundary trimming
Domains News, entertainment, education, conversational
License CC-BY-NC-SA-4.0 (research use only)

Installation

Step 1: Install PyTorch (NVIDIA GPU)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

Step 2: Install OmniVoice

pip install omnivoice

Usage

Python API

import torch
import torchaudio
from omnivoice import OmniVoice

# Load the Vietnamese fine-tuned model
model = OmniVoice.from_pretrained(
    "splendor1811/omnivoice-vietnamese",
    device_map="cuda:0",
    dtype=torch.float16,
)

# Zero-shot voice cloning
audio = model.generate(
    text="Xin chào, đây là mô hình tổng hợp giọng nói tiếng Việt.",
    language="vietnamese",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
)

torchaudio.save("output.wav", audio[0], 24000)

With Cached Voice Prompt (recommended for serving)

# Create voice prompt once (caches the encoded reference audio)
voice_prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
)

# Reuse for multiple generations — no re-encoding cost
audio = model.generate(
    text="Em chào anh, em gọi từ tổng đài ngân hàng.",
    language="vietnamese",
    voice_clone_prompt=voice_prompt,
)

With torch.compile (recommended for production)

from omnivoice import OmniVoiceGenerationConfig

# Apply torch.compile for faster inference
torch.set_float32_matmul_precision("high")
model.llm = torch.compile(model.llm, mode="reduce-overhead", dynamic=True)

# Warmup (triggers compilation)
config = OmniVoiceGenerationConfig(num_step=8, guidance_scale=2.0)
for _ in range(3):
    model.generate(
        text="Xin chào.",
        language="vietnamese",
        voice_clone_prompt=voice_prompt,
        generation_config=config,
    )

# Production inference at num_step=8 for speed
audio = model.generate(
    text="Dạ chào anh, anh có cần hỗ trợ gì không ạ?",
    language="vietnamese",
    voice_clone_prompt=voice_prompt,
    generation_config=config,
)

Performance

Benchmarked on NVIDIA L4 GPU with num_step=8 and torch.compile(mode="reduce-overhead"):

Metric Value
RTF (Real-Time Factor) ~0.07 (14x real-time)
TTFB (short sentence) ~210 ms
P95 TTFB at CCU=4 ~1.26 s
Sample rate 24,000 Hz

Base Model: OmniVoice

OmniVoice is a massively multilingual zero-shot TTS model supporting 600+ languages, built on a diffusion language model architecture with Qwen3-0.6B as the backbone.

Key features of the base model:

  • 600+ languages — broadest coverage among zero-shot TTS models
  • Voice cloning — state-of-the-art from short reference audio
  • Voice design — control via speaker attributes (gender, age, pitch, accent)
  • Non-verbal symbols[laughter], [breath], etc.
  • Fast inference — RTF as low as 0.025 (40x real-time)

Citation

@article{zhu2026omnivoice,
  title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
  author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
  journal={arXiv preprint arXiv:2604.00688},
  year={2026}
}

@dataset{dolly_audio_2025,
  title={Dolly-Audio: Vietnamese Multi-Speaker High-Quality Speech Corpus},
  author={Nguyen, Vinh Huy and Nguyen, Dinh Thuan},
  year={2025},
  publisher={Dolly AI Team},
  howpublished={\url{https://huggingface.co/datasets/dolly-vn/dolly-audio-1000h-vietnamese}},
  note={Released under CC-BY-NC-SA-4.0. Research use only.}
}
Downloads last month
326
Safetensors
Model size
0.6B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for splendor1811/omnivoice-vietnamese

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(8)
this model

Dataset used to train splendor1811/omnivoice-vietnamese

Paper for splendor1811/omnivoice-vietnamese