Update pipeline tag to any-to-any and add sample usage

This PR improves the model card for Audio-Omni by:
- Updating the `pipeline_tag` to `any-to-any` to accurately reflect its capabilities in understanding, generation, and editing.
- Updating the placeholder arXiv link to the correct paper ([2604.10708](https://huggingface.co/papers/2604.10708)).
- Adding a sample usage section with Python code snippets verified from the official GitHub repository.

Files changed (1) hide show

README.md +22 -104

README.md CHANGED Viewed

@@ -1,15 +1,15 @@
 ---
 license: cc-by-nc-4.0
-pipeline_tag: text-to-audio
 tags:
-  - text-to-audio
-  - text-to-speech
-  - audio-editing
-  - music
-  - speech
-  - diffusion
-  - multimodal
-  - audio-generation
 ---
 # 🎛️ Audio-Omni
@@ -18,7 +18,7 @@ tags:
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/ZeyueT/Audio-Omni)
 [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://zeyuet.github.io/Audio-Omni/)
-[![arXiv](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/XXXX.XXXXX)
 ## 📖 Overview
@@ -30,12 +30,6 @@ Audio-Omni is the first end-to-end framework that unifies **understanding**, **g
 - **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
 - **Editing**: Add, Remove, Extract, Style Transfer
-## 📦 Model Files
-- `Audio-Omni.json` — Model configuration
-- `model.ckpt` — Model checkpoint (~21 GB)
-- `synchformer_state_dict.pth` — Synchformer checkpoint for video conditioning
 ## 🚀 Quick Start
 ### Installation
@@ -53,7 +47,7 @@ conda install -c conda-forge ffmpeg libsndfile
 huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
 ```
-### Python API
 ```python
 from audio_omni import AudioOmni
@@ -61,87 +55,28 @@ import torchaudio
 # Load model
 model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
-```
-### 1️⃣ Understanding
-```python
-# Audio understanding
 response = model.understand(
     "Describe the sounds in this audio.",
     audio="example.wav"
 )
-# Video understanding
-response = model.understand(
-    "What is happening in this video?",
-    video="example.mp4"
-)
-# Audio + Video understanding
-response = model.understand(
-    "Does the audio match the video?",
-    audio="example.wav",
-    video="example.mp4"
-)
-```
-### 2️⃣ Generation
-```python
-# Text-to-Audio
 audio = model.generate("T2A", prompt="A clock ticking.")
 torchaudio.save("output.wav", audio, model.sample_rate)
-# Text-to-Music
-audio = model.generate(
-    "T2M",
-    prompt="Compose a bright jazz swing instrumental with walking bass."
-)
-torchaudio.save("music.wav", audio, model.sample_rate)
-# Video-to-Audio
-audio = model.generate("V2A", video_path="example.mp4")
-torchaudio.save("v2a_output.wav", audio, model.sample_rate)
-# Text-to-Speech
-audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
-torchaudio.save("tts_output.wav", audio, model.sample_rate)
-# Text-to-Speech with voice cloning
-audio = model.generate(
-    "TTS",
-    prompt="Hello, welcome to Audio-Omni.",
-    voice_prompt_path="ref_voice.wav",
-    voice_ref_text="This is the reference transcript."
-)
-torchaudio.save("tts_cloned.wav", audio, model.sample_rate)
-```
-### 3️⃣ Editing
-```python
-# Add a sound
 audio = model.edit("Add", "input.wav", desc="skateboarding")
 torchaudio.save("output_add.wav", audio, model.sample_rate)
-# Remove a sound
-audio = model.edit("Remove", "input.wav", desc="female singing")
-torchaudio.save("output_remove.wav", audio, model.sample_rate)
-# Extract a sound
-audio = model.edit("Extract", "input.wav", desc="wood thrush calling")
-torchaudio.save("output_extract.wav", audio, model.sample_rate)
-# Style transfer
-audio = model.edit(
-    "Style Transfer",
-    "input.wav",
-    source_category="playing electric guitar",
-    target_category="playing saxophone"
-)
-torchaudio.save("output_transfer.wav", audio, model.sample_rate)
-```
 ## 🖥️ Gradio Demo
@@ -153,34 +88,17 @@ python run_gradio.py \
     --server-port 7777
 ```
-Visit `http://localhost:7777` to access the web interface.
-## 📚 Documentation
-For detailed documentation, training instructions, and more examples, visit the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).
 ## 📝 Citation
 ```bibtex
 @article{tian2026audioomni,
   title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
   author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
-  journal={arXiv preprint arXiv:submit/7470507},
   year={2026}
 }
 ```
 ## 📄 License
-CC-BY-NC-4.0 (Non-commercial use only)
-Commercial use of the model weights requires explicit written authorization from the authors.
-For commercial licensing inquiries, contact: ztianad@connect.ust.hk
-## 📭 Contact
-- **Zeyue Tian**: ztianad@connect.ust.hk
----
-**For full installation guide, API reference, and advanced usage, see the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).**

 ---
 license: cc-by-nc-4.0
+pipeline_tag: any-to-any
 tags:
+- text-to-audio
+- text-to-speech
+- audio-editing
+- music
+- speech
+- diffusion
+- multimodal
+- audio-generation
 ---
 # 🎛️ Audio-Omni
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/ZeyueT/Audio-Omni)
 [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://zeyuet.github.io/Audio-Omni/)
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-red)](https://huggingface.co/papers/2604.10708)
 ## 📖 Overview
 - **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
 - **Editing**: Add, Remove, Extract, Style Transfer
 ## 🚀 Quick Start
 ### Installation
 huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
 ```
+### Sample Usage
 ```python
 from audio_omni import AudioOmni
 # Load model
 model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
+# 1. Understanding
 response = model.understand(
     "Describe the sounds in this audio.",
     audio="example.wav"
 )
+print(response)
+# 2. Generation (Text-to-Audio)
 audio = model.generate("T2A", prompt="A clock ticking.")
 torchaudio.save("output.wav", audio, model.sample_rate)
+# 3. Editing (Add a sound)
 audio = model.edit("Add", "input.wav", desc="skateboarding")
 torchaudio.save("output_add.wav", audio, model.sample_rate)
+```
+## 📦 Model Files
+- `Audio-Omni.json` — Model configuration
+- `model.ckpt` — Model checkpoint (~21 GB)
+- `synchformer_state_dict.pth` — Synchformer checkpoint for video conditioning
 ## 🖥️ Gradio Demo
     --server-port 7777
 ```
 ## 📝 Citation
 ```bibtex
 @article{tian2026audioomni,
   title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
   author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
+  journal={arXiv preprint arXiv:2604.10708},
   year={2026}
 }
 ```
 ## 📄 License
+CC-BY-NC-4.0 (Non-commercial use only). Commercial use of the model weights requires explicit written authorization from the authors. For commercial licensing inquiries, contact: ztianad@connect.ust.hk