This is a model card for the GitHub repo tts_arabic

Arabic TTS models (FastPitch, MixerTTS) from the tts-arabic-pytorch repo in the ONNX format — usable as a Python package for offline speech synthesis.

Audio samples can be found here.

Install with

pip install git+https://github.com/nipponjo/tts_arabic.git

Examples

# %%
from tts_arabic import tts

# %%
text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."
wave = tts(text, speaker=2, pace=0.9, play=True)

# %% Buckwalter transliteration
text = ">als~alAmu Ealaykum yA Sadiyqiy."
wave = tts(text, speaker=0, play=True)

# %% Unvocalized input
text_unvoc = "القهوة مشروب يعد من بذور البن المحمصة"
wave = tts(text_unvoc, play=True, vowelizer='shakkelha')

Pretrained models

Model	Model ID	Type	#params	Paper	Output
FastPitch	fastpitch	Text->Mel	46.3M	arxiv	Mel (80 bins)
MixerTTS	mixer128	Text->Mel	2.9M	arxiv	Mel (80 bins)
MixerTTS	mixer80	Text->Mel	1.5M	arxiv	Mel (80 bins)
HiFi-GAN	hifigan	Vocoder	13.9M	arxiv	Wave (22.05kHz)
Vocos	vocos	Vocoder	13.4M	arxiv	Wave (22.05kHz)
Vocos	vocos44	Vocoder	14.0M	arxiv	Wave (44.1kHz)

The sequence of transformations is as follows:

Text → Phonemizer → Phonemes → Tokenizer → Token Ids → Text->Mel model → Mel spectrogram → Vocoder model → Wave

The Text->Mel models map token ids to mel frames. All models use the 80 bin configuration proposed by HiFi-GAN. This mel spectrogram contains frequencies up to 8kHz. The vocoder models map the mel spectrogram to a waveform. The vocoders with vocoder_id hifigan and vocos artificially extend the bandwidth to 11025Hz, and vocos44 to 22050Hz. Samples for comparing the models can be found here.

Manuscript

More information about how the models were trained can be found in the manuscript Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis (arXiv | ResearchGate).

TTS options

from tts_arabic import tts

text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."
wave = tts(
    text, # input text
    speaker = 1, # speaker id; choose between 0,1,2,3
    pace = 1, # speaker pace
    denoise = 0.005, # vocoder denoiser strength
    volume = 0.9, # Max amplitude (between 0 and 1)
    play = True, # play audio?
    pitch_mul = 1, # pitch multiplier
    pitch_add = 0, # pitch offset
    vowelizer = None, # vowelizer model
    model_id = 'fastpitch', # Model ID for Text->Mel model
    vocoder_id = 'hifigan', # Model ID for vocoder model
    cuda = None, # Optional; CUDA device index
    save_to = './test.wav', # Optionally; save audio WAV file
    bits_per_sample = 32, # when save_to is specified (8, 16 or 32 bits)
    )

Vowelizer models

Model	Model ID	Paper	Repo	Architecture
CATT	catt_eo	arxiv	github	Transformer Encoder
Shakkelha	shakkelha	arxiv	github	Bi-LSTM
Shakkala	shakkala	-	github	Bi-LSTM

References

The vocoder vocos44 was converted from (patriotyk/vocos-mel-hifigan-compat-44100khz).

The vowelizer catt_eo was converted from https://github.com/abjadai/catt/releases/tag/v2 best_eo_mlm_ns_epoch_193.pt (License: Apache-2.0)

Downloads last month: -; Downloads are not tracked for this model. How to track

nipponjo
/

tts-arabic-onnx

Dataset used to train nipponjo/tts-arabic-onnx