Access Streaming Speech Translation — Vertox-AI

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To access Streaming Speech Translation — Vertox-AI, you must review and agree to the CC BY-NC 4.0 license. By submitting this form, you confirm that you have read the license and will only use the model under its terms. Requests are processed immediately.

Log in or Sign Up to review the conditions and access this model content.

Streaming Speech Translation Pipeline

Real-time English → Russian speech translation: Audio In → ASR → NMT → TTS → Audio Out Translates spoken English into spoken Russian with streaming output over WebSocket.

Input can only be English for now (due to ASR NeMo), while output language depending on TranslateGemma (NMT) and XTTSv2 (TTS). You can modify these accordingly.

Architecture

Audio Input → ASR (ONNX) → NMT (GGUF) → TTS (ONNX) → Audio Output
  (PCM16)   Conformer RNN-T  TranslateGemma  XTTSv2     (PCM16)
  • ASR: NVIDIA NeMo FastConformer RNN-T (cache-aware streaming, ONNX)
  • NMT: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging
  • TTS: XTTSv2 with GPT-2 AR model + HiFi-GAN vocoder (ONNX), 24kHz output

See ARCHITECTURE.md for detailed design documentation.

Requirements

  • Python 3.12
  • CUDA 12.8, CUDNN 9
  • llama-cpp-python with GPU
  • onnxruntime with GPU
  • Model files:
    • ASR: NeMo FastConformer RNN-T ONNX model directory
    • NMT: TranslateGemma 4B GGUF file
    • TTS: XTTSv2 ONNX model directory, BPE vocab, mel normalization stats, reference audio

Installation

  • For CUDA and CUDNN:
# uninstall, remove, and purge previous CUDA and CUDNN installation if they are not the correct versions
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-8 cudnn9-cuda-12
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-13.2/lib64:/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}' >> ~/.bashrc
echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
  • For llama-cpp-python (GGUF):
echo 'export CMAKE_ARGS="-DGGML_CUDA=on"' >> ~/.bashrc
echo 'export CUDACXX=/usr/local/cuda/bin/nvcc' >> ~/.bashrc
echo 'export FORCE_CMAKE=1' >> ~/.bashrc
source ~/.bashrc
sudo apt-get install -y build-essential cmake ninja-build python3-dev
pip uninstall -y llama-cpp-python
pip install --no-cache-dir --force-reinstall llama-cpp-python
  • For this model package:
pip install -r requirements.txt

System Dependencies

# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2

Usage

Start the Server

  • Recommended to use at least g6.xlarge (L4, 4vCPUs, 16GB RAM). GPU memory usage will be around 8.8 GB for all 3 ASR/NMT/XTTSv2 models.
  • Total model latencies will be around 300 ms (ASR 18 ms., NMT 150 ms., XTTSv2 150 ms.). g5.xlarge with A10G will have a 1.5x faster processing time.
  • This, still bounded by the ASR lookahead 560 ms (NeMO Cache-Aware Streaming ASR FastConformer-RNN-T).
  • Hence, the effective delay is 560 ms, which is near-instantaneous for simultaneous/streaming speech translation.
python app.py \
  --asr-onnx-path models/asr/nemo-cache-aware-streaming-560ms-onnx/ \
  --nmt-gguf-path models/nmt/translategemma-4b-it-q8_0-gguf/translategemma-4b-it-q8_0.gguf \
  --nmt-n-threads 2 \
  --nmt-use-gpu \
  --streaming-tts-config-path configs/tts/xtts/xtts_streaming_tts_config.json \
  --streaming-tts-pipeline-config-path configs/tts/xtts/xtts_streaming_tts_pipeline_config.json \
  --tts-ref-audio-path audio_ref/male_stewie.mp3 \
  --host 0.0.0.0 \
  --port 8765

CLI Options

  • You can find TTS model specific related configurations in the configs/tts/xtts/ dir., which are used for the streaming-tts-config-path and streaming-tts-pipeline-config-path.
  • These include TTS's model location, queue sizes, chunk sizes, number of threads, and gpu usage.
Flag Default Description
--asr-onnx-path (required) ASR ONNX model directory
--asr-chunk-ms 10 ASR audio chunk duration (ms)
--asr-sample-rate 16000 ASR expected sample rate
--nmt-gguf-path (required) NMT GGUF model file
--nmt-n-threads 2 NMT CPU threads
--nmt-use-gpu False Toggle to use GPU for NMT
--streaming-tts-config-path (required) Config file for StreamingTTS instantiation
--streaming-tts-pipeline-config-path (required) Config file for StreamingTTSPipeline instantiation
--tts-ref-audio-path (required) TTS reference speaker audio
--tts-language ru TTS target language code
--audio-queue-max 256 Audio input queue max size
--audio-out-queue-max 32 Audio output queue max size
--host 0.0.0.0 Server bind host
--port 8765 Server port

Python Client

Captures microphone audio and plays back translated speech:

pip install -r requirements_client.txt
python clients/python_client.py --uri ws://<server_ip_address/localhost>:8765

Web Client

TBD

WebSocket Protocol

Direction Type Format Description
Client→ Binary PCM16 Raw audio at declared sample rate
Client→ Text JSON {"action": "start", "sample_rate": 16000}
Client→ Text JSON {"action": "stop"}
→Client Binary PCM16 Synthesized audio at 24kHz
→Client Text JSON {"type": "transcript", "text": "..."}
→Client Text JSON {"type": "translation", "text": "..."}
→Client Text JSON {"type": "status", "status": "started"}

Docker

TBD

Project Structure

streaming_speech_translation/
├── app.py                              # Main entry point
├── requirements.txt
├── README.md
├── ARCHITECTURE.md
├── Dockerfile
├── models/
│   ├── asr/
│   │   └── nemo-cache-aware-streaming-560ms-onnx/
│   ├── nmt/
│   │   ├── translategemma-4b-it-q8_0-gguf/
│   │   └── translategemma-4b-it-q4_k_m-gguf/
│   └── tts/
│       └── xttsv2-onnx/
├── src/
│   ├── asr/
│   │   ├── streaming_asr.py            # StreamingASR wrapper
│   │   ├── cache_aware_modules.py      # Audio buffer + streaming ASR
│   │   ├── cache_aware_modules_config.py
│   │   ├── modules.py                  # ONNX model loading
│   │   ├── modules_config.py
│   │   ├── onnx_utils.py
│   │   └── utils.py                    # Audio utilities
│   ├── nmt/
│   │   ├── streaming_nmt.py            # StreamingNMT wrapper
│   │   ├── streaming_segmenter.py      # Word-group segmentation
│   │   ├── streaming_translation_merger.py
│   │   └── translator_module.py        # TranslateGemma via llama-cpp
│   ├── tts/
│   │   ├── base
│   │   │   ├── streaming_tts_base.py               # Base class for StreamingTTS wrapper
│   │   │   └── streaming_tts_pipeline_base.py      # Base class for StreamingTTSPipeline wrapper
│   │   ├── base
│   │   │   ├── streaming_tts_factory.py            # Factory class for StreamingTTS wrapper
│   │   │   └── streaming_tts_pipeline_factory.py   # Factory class for StreamingTTSPipeline wrapper
│   │   ├── pipeline
│   │   │   └── xtts/
│   │   │        └── xtts_streaming_pipeline.py     # XTTSv2-specific StreamingTTSPipeline class
│   │   ├── streaming
│   │   │   └── xtts/
│   │   │        └── xtts_streaming_tts.py          # XTTSv2-specific StreamingTTS class
│   │   └── utils/
│   │       ├── tts_segmenter.py                    # Handler for TTS text input segmenter
│   │       └── xtts/
│   │            ├── xtts_onnx_orchestrator.py      # ONNX sessions handler for XTTSv2 modules
│   │            ├── xtts_tokenizer.py              # XTTSv2 BPE tokenizer
│   │            ├── xtts_tts_warmup.py             # Warmup handler for XTTSv2 ONNX sessions
│   │            └── zh_num2words.py                # XTTSv2 Chinese text normalization
│   ├── pipeline/
│   │   ├── orchestrator.py             # PipelineOrchestrator
│   │   └── config.py                   # PipelineConfig
│   └── server/
│       └── websocket_server.py         # WebSocket server
└── clients/
    ├── python_client.py                # Python CLI client
    └── web_client.html                 # Browser client

See ARCHITECTURE.md for the full concurrency diagram and queue map.

LICENSE and COPYRIGHT

This repository is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:

  • ✅ Research and academic use
  • ✅ Personal experimentation
  • ✅ Open-source contributions
  • ❌ Commercial applications
  • ❌ Production deployment
  • ❌ Monetized services

By: Patrick Lumbantobing

Copyright@VertoX-AI

Citation

If you use this system in your research, please cite:

@misc{vertoxai2026streamingspeechtranslation,
  title={Streaming Speech Translation — VertoX-AI},
  author={Tobing, P. L., VertoX-AI},
  year={2026},
  publisher={HuggingFace},
}

Acknowledgments

Downloads last month
11
GGUF
Model size
4B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pltobing/streaming-speech-translation

Quantized
(8)
this model

Collection including pltobing/streaming-speech-translation