This is a model card for the GitHub repo tts_arabic

Arabic TTS models (FastPitch, MixerTTS) from the tts-arabic-pytorch repo in the ONNX format โ€” usable as a Python package for offline speech synthesis.

Audio samples can be found here.

Install with

pip install git+https://github.com/nipponjo/tts_arabic.git

Examples

# %%
from tts_arabic import tts

# %%
text = "ุงูŽู„ุณู‘ูŽู„ุงู…ู ุนูŽู„ูŽูŠูƒูู… ูŠูŽุง ุตูŽุฏููŠู‚ููŠ."
wave = tts(text, speaker=2, pace=0.9, play=True)

# %% Buckwalter transliteration
text = ">als~alAmu Ealaykum yA Sadiyqiy."
wave = tts(text, speaker=0, play=True)

# %% Unvocalized input
text_unvoc = "ุงู„ู‚ู‡ูˆุฉ ู…ุดุฑูˆุจ ูŠุนุฏ ู…ู† ุจุฐูˆุฑ ุงู„ุจู† ุงู„ู…ุญู…ุตุฉ"
wave = tts(text_unvoc, play=True, vowelizer='shakkelha')

Pretrained models

Model Model ID Type #params Paper Output
FastPitch fastpitch Text->Mel 46.3M arxiv Mel (80 bins)
MixerTTS mixer128 Text->Mel 2.9M arxiv Mel (80 bins)
MixerTTS mixer80 Text->Mel 1.5M arxiv Mel (80 bins)
HiFi-GAN hifigan Vocoder 13.9M arxiv Wave (22.05kHz)
Vocos vocos Vocoder 13.4M arxiv Wave (22.05kHz)
Vocos vocos44 Vocoder 14.0M arxiv Wave (44.1kHz)

The sequence of transformations is as follows:

Text โ†’ Phonemizer โ†’ Phonemes โ†’ Tokenizer โ†’ Token Ids โ†’ Text->Mel model โ†’ Mel spectrogram โ†’ Vocoder model โ†’ Wave

The Text->Mel models map token ids to mel frames. All models use the 80 bin configuration proposed by HiFi-GAN. This mel spectrogram contains frequencies up to 8kHz. The vocoder models map the mel spectrogram to a waveform. The vocoders with vocoder_id hifigan and vocos artificially extend the bandwidth to 11025Hz, and vocos44 to 22050Hz. Samples for comparing the models can be found here.

Manuscript

More information about how the models were trained can be found in the manuscript Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis (arXiv | ResearchGate).

TTS options

from tts_arabic import tts

text = "ุงูŽู„ุณู‘ูŽู„ุงู…ู ุนูŽู„ูŽูŠูƒูู… ูŠูŽุง ุตูŽุฏููŠู‚ููŠ."
wave = tts(
    text, # input text
    speaker = 1, # speaker id; choose between 0,1,2,3
    pace = 1, # speaker pace
    denoise = 0.005, # vocoder denoiser strength
    volume = 0.9, # Max amplitude (between 0 and 1)
    play = True, # play audio?
    pitch_mul = 1, # pitch multiplier
    pitch_add = 0, # pitch offset
    vowelizer = None, # vowelizer model
    model_id = 'fastpitch', # Model ID for Text->Mel model
    vocoder_id = 'hifigan', # Model ID for vocoder model
    cuda = None, # Optional; CUDA device index
    save_to = './test.wav', # Optionally; save audio WAV file
    bits_per_sample = 32, # when save_to is specified (8, 16 or 32 bits)
    )

Vowelizer models

Model Model ID Paper Repo Architecture
CATT catt_eo arxiv github Transformer Encoder
Shakkelha shakkelha arxiv github Bi-LSTM
Shakkala shakkala - github Bi-LSTM

References

The vocoder vocos44 was converted from (patriotyk/vocos-mel-hifigan-compat-44100khz).

The vowelizer catt_eo was converted from https://github.com/abjadai/catt/releases/tag/v2 best_eo_mlm_ns_epoch_193.pt (License: Apache-2.0)

DALLยทE 2025-03-14 18 56 01 - A surreal digital painting of a camel in a vast desert, with a futuristic speaker embedded in its mouth, symbolizing text-to-speech technology  The ca

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train nipponjo/tts-arabic-onnx