Irodori-TTS-500M-v2

Code WandB Demo Space

Irodori-TTS-500M-v2 is a Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. The architecture and training design largely follow Echo-TTS, using continuous latents as the generation target. It supports zero-shot voice cloning from reference audio.

A unique feature of this model is emoji-based style and sound effect control โ€” by inserting specific emojis into the input text, you can control speaking styles, emotions, and even sound effects in the generated audio.

๐ŸŒŸ Key Features

  • Flow Matching TTS: Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
  • Voice Cloning: Zero-shot voice cloning from a short reference audio clip.
  • Emoji-based Style Control: Control speaking styles, emotions, and sound effects by embedding emojis directly in the input text. See EMOJI_ANNOTATIONS.md for the full list of supported emojis and their effects.

โœจ What's New in v2

This version brings several improvements over the original Irodori-TTS-500M:

  • Upgraded VAE: Switched the audio VAE to Aratako/Semantic-DACVAE-Japanese-32dim, enabling higher-quality Japanese speech generation.
  • Extended Training: The number of training steps has been increased by 2.5 times, resulting in better convergence, stability, and overall audio fidelity.
  • Data & Preprocessing Improvements: Implemented refined text preprocessing pipelines and stricter data filtering to enhance the model's robustness and output quality.

๐Ÿ—๏ธ Architecture

The model (approximately 500M parameters) consists of three main components:

  1. Text Encoder: Token embeddings initialized from llm-jp/llm-jp-3-150m, followed by self-attention + SwiGLU transformer layers with RoPE.
  2. Reference Latent Encoder: Encodes patched reference audio latents for speaker/style conditioning via self-attention + SwiGLU layers.
  3. Diffusion Transformer: Joint-attention DiT blocks with Low-Rank AdaLN (timestep-conditioned adaptive layer normalization), half-RoPE, and SwiGLU MLPs.

Audio is represented as continuous latent sequences via the Aratako/Semantic-DACVAE-Japanese-32dim codec (32-dim), enabling high-quality 48kHz waveform reconstruction.

๐ŸŽง Audio Samples

1. Standard TTS

Basic Japanese text-to-speech generation (without reference audio).

Case Text Generated Audio
Sample 1 "ใŠ้›ป่ฉฑใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™ใ€‚ใŸใ ใ„ใพ้›ป่ฉฑใŒๅคงๅค‰ๆททใฟๅˆใฃใฆใŠใ‚Šใพใ™ใ€‚ๆใ‚Œๅ…ฅใ‚Šใพใ™ใŒใ€็™บไฟก้Ÿณใฎใ‚ใจใซใ€ใ”็”จไปถใ‚’ใŠ่ฉฑใ—ใใ ใ•ใ„ใ€‚"
Sample 2 "ใใฎๆฃฎใซใฏใ€ๅคใ„่จ€ใ„ไผใˆใŒใ‚ใ‚Šใพใ—ใŸใ€‚ๆœˆใŒๆœ€ใ‚‚้ซ˜ใๆ˜‡ใ‚‹ๅคœใ€้™ใ‹ใซ่€ณใ‚’ๆพ„ใพใ›ใฐใ€้ขจใฎๆญŒๅฃฐใŒ่žใ“ใˆใ‚‹ใจใ„ใ†ใฎใงใ™ใ€‚็งใฏๅŠไฟกๅŠ็–‘ใงใ—ใŸใŒใ€ใใฎๅคœใ€็ขบใ‹ใซ่ชฐใ‹ใŒ็งใ‚’ๅ‘ผใถๅฃฐใ‚’่žใ„ใŸใฎใงใ™ใ€‚"

2. Emoji Annotation Control

Examples of controlling speaking style and effects with emojis. For the full list of supported emojis, see EMOJI_ANNOTATIONS.md.

Case Text (with Emoji) Generated Audio
Sample 1 ใชใƒผใซใ€ใฉใ†ใ—ใŸใฎ๏ผŸโ€ฆใˆ๏ผŸใ‚‚ใฃใจ่ฟ‘ใฅใ„ใฆใปใ—ใ„๏ผŸโ€ฆ๐Ÿ‘‚๐Ÿ˜ฎโ€๐Ÿ’จ๐Ÿ‘‚๐Ÿ˜ฎโ€๐Ÿ’จใ“ใ†ใ„ใ†ใฎใŒๅฅฝใใชใ‚“ใ ๏ผŸ
Sample 2 ใ†ใ…โ€ฆ๐Ÿ˜ญใใ‚“ใชใซ้…ทใ„ใ“ใจใ€่จ€ใ‚ใชใ„ใงโ€ฆ๐Ÿ˜ญ
Sample 3 ๐Ÿคง๐Ÿคงใ”ใ‚ใ‚“ใญใ€้ขจ้‚ชๅผ•ใ„ใกใ‚ƒใฃใฆใฆ๐Ÿคงโ€ฆๅคงไธˆๅคซใ€ใŸใ ใฎ้ขจ้‚ชใ ใ‹ใ‚‰ใ™ใๆฒปใ‚‹ใ‚ˆ๐Ÿฅบ

3. Voice Cloning (Zero-shot)

Examples of cloning a voice from a reference audio clip.

Case Reference Audio Generated Audio
Example 1
Example 2

๐Ÿš€ Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

๐Ÿ‘‰ GitHub: Aratako/Irodori-TTS

๐Ÿ“Š Training Data & Annotation

The model was trained on a high-quality Japanese speech dataset, refined with improved data filtering in v2. To enable the emoji-based style control, the training texts were enriched with emoji annotations. These annotations were automatically generated and labeled using a fine-tuned model based on Qwen/Qwen3-Omni-30B-A3B-Instruct.

โš ๏ธ Limitations

  • Japanese Only: This model currently supports Japanese text input only.
  • Emoji Control: While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
  • Audio Quality: Quality depends on training data characteristics. Performance may vary for voices or speaking styles underrepresented in the training data.
  • Kanji Reading Accuracy: The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.

๐Ÿ“œ License & Ethical Restrictions

License

This model is released under MIT.

Ethical Restrictions

In addition to the license terms, the following ethical restrictions apply:

  1. No Impersonation: Do not use this model to clone or impersonate the voice of any individual (e.g., voice actors, celebrities, public figures) without their explicit consent.
  2. No Misinformation: Do not use this model to generate deepfakes or synthetic speech intended to mislead others or spread misinformation.
  3. Disclaimer: The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.

๐Ÿ™ Acknowledgments

This project builds upon the following works:

We would also like to extend our special thanks to Respair for the inspiration behind the emoji annotation feature.

๐Ÿ–Š๏ธ Citation

If you use Irodori-TTS-v2 in your research or project, please cite it as follows:

@misc{irodori-tts-v2,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Aratako/Irodori-TTS-500M-v2

Finetunes
1 model

Spaces using Aratako/Irodori-TTS-500M-v2 3

Collection including Aratako/Irodori-TTS-500M-v2