Audio-to-Audio
Moshi
Safetensors
English

Model Card for MoshiRAG

MoshiRAG is a compact full-duplex speech language model augmented with asynchronous knowledge retrieval to improve factuality without sacrificing real-time interactivity. Built on top of Moshi, MoshiRAG predicts when a query needs external knowledge, retrieves references within the natural response delay, and grounds answers in stronger knowledge sources.

Model Details

Candle (Rust) version with bf16 precision.

Model Description

MoshiRAG uses a modular front-end/back-end design:

  • Front-end is a full-duplex speech model based on Moshi that handles real-time conversation.
  • Back end is an asynchronous retrieval system running in parallel to fetch factual information when needed.

The front end keeps listening and speaking continuously. When the model predicts a retrieval trigger token, conversation context is sent to the retrieval back end while the dialogue continues. During this period, the model can produce lightweight pre-RAG content (for example, short acknowledgments or coarse responses) so the interaction stays natural.

The back end is text-in/text-out and can be implemented with different retrieval methods (LLM-based retrieval or search-based retrieval, etc). The retrieval back end takes conversation context (derived by combining the text predicted by Moshi inner monologue and the user transcription predicted by a streaming ASR component) as inputs, and then returns the reference text. Once the retrieval is completed, the reference text is encoded and injected back into Moshi as a stream, allowing later response segments to be grounded in external knowledge without interrupting the ongoing conversation.

This repository contains the front-end model only. Please refer to the Github repository for more details about various choices of retrieval back ends.

  • Developed by: Kyutai
  • Model type: Multimodal speech-text foundation model with additional text conditioning
  • Language(s) (NLP): English
  • License: CC-BY 4.0
  • Dependency: MoshiRAG uses frozen components from ARC-Encoder. An additional Streaming ASR model is required to transcribe the user speech so as to provide conversation context to the retrieval back end.

Model Sources

Disclaimers

Out-of-Scope Use

The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.

Bias, Risks, and Limitations

MoshiRAG has been trained with a few safeguards to try to limit potential toxic usages, however our toxicity analysis shows that it behaves in the middle of existing models with respect to textual generation. It has some bias towards certain domains and topics that are over-represented in the training data. Its capabilities are relatively limited so far and it is trained to produce only one voice to avoid impersonation. Yet, we need the perspective in time to establish the sociotechnical limitations.

Content generated by MoshiRAG may be affected by the retrieval back end. Safety risks and biases are minimized if a properly safeguarded back-end retrieval component is used.

Citation

@misc{chien2026moshirag,
      title={MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models}, 
      author={Chung-Ming Chien and Manu Orsini and Eugene Kharitonov and Neil Zeghidour and Karen Livescu and Alexandre D{\'e}fossez},
      year={2026},
      eprint={2604.12928},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.12928}, 
}

Model Card Authors

Chung-Ming Chien

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kyutai/moshika-rag-candle-bf16

Finetuned
(1)
this model

Dataset used to train kyutai/moshika-rag-candle-bf16

Paper for kyutai/moshika-rag-candle-bf16