Update pipeline tag to any-to-any and add sample usage
Browse filesThis PR improves the model card for Audio-Omni by:
- Updating the `pipeline_tag` to `any-to-any` to accurately reflect its capabilities in understanding, generation, and editing.
- Updating the placeholder arXiv link to the correct paper ([2604.10708](https://huggingface.co/papers/2604.10708)).
- Adding a sample usage section with Python code snippets verified from the official GitHub repository.
README.md
CHANGED
|
@@ -1,15 +1,15 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
| 3 |
-
pipeline_tag:
|
| 4 |
tags:
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
---
|
| 14 |
|
| 15 |
# ποΈ Audio-Omni
|
|
@@ -18,7 +18,7 @@ tags:
|
|
| 18 |
|
| 19 |
[](https://github.com/ZeyueT/Audio-Omni)
|
| 20 |
[](https://zeyuet.github.io/Audio-Omni/)
|
| 21 |
-
[](https://
|
| 22 |
|
| 23 |
## π Overview
|
| 24 |
|
|
@@ -30,12 +30,6 @@ Audio-Omni is the first end-to-end framework that unifies **understanding**, **g
|
|
| 30 |
- **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
|
| 31 |
- **Editing**: Add, Remove, Extract, Style Transfer
|
| 32 |
|
| 33 |
-
## π¦ Model Files
|
| 34 |
-
|
| 35 |
-
- `Audio-Omni.json` β Model configuration
|
| 36 |
-
- `model.ckpt` β Model checkpoint (~21 GB)
|
| 37 |
-
- `synchformer_state_dict.pth` β Synchformer checkpoint for video conditioning
|
| 38 |
-
|
| 39 |
## π Quick Start
|
| 40 |
|
| 41 |
### Installation
|
|
@@ -53,7 +47,7 @@ conda install -c conda-forge ffmpeg libsndfile
|
|
| 53 |
huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
|
| 54 |
```
|
| 55 |
|
| 56 |
-
###
|
| 57 |
|
| 58 |
```python
|
| 59 |
from audio_omni import AudioOmni
|
|
@@ -61,87 +55,28 @@ import torchaudio
|
|
| 61 |
|
| 62 |
# Load model
|
| 63 |
model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
|
| 64 |
-
```
|
| 65 |
-
|
| 66 |
-
### 1οΈβ£ Understanding
|
| 67 |
|
| 68 |
-
|
| 69 |
-
# Audio understanding
|
| 70 |
response = model.understand(
|
| 71 |
"Describe the sounds in this audio.",
|
| 72 |
audio="example.wav"
|
| 73 |
)
|
|
|
|
| 74 |
|
| 75 |
-
#
|
| 76 |
-
response = model.understand(
|
| 77 |
-
"What is happening in this video?",
|
| 78 |
-
video="example.mp4"
|
| 79 |
-
)
|
| 80 |
-
|
| 81 |
-
# Audio + Video understanding
|
| 82 |
-
response = model.understand(
|
| 83 |
-
"Does the audio match the video?",
|
| 84 |
-
audio="example.wav",
|
| 85 |
-
video="example.mp4"
|
| 86 |
-
)
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
### 2οΈβ£ Generation
|
| 90 |
-
|
| 91 |
-
```python
|
| 92 |
-
# Text-to-Audio
|
| 93 |
audio = model.generate("T2A", prompt="A clock ticking.")
|
| 94 |
torchaudio.save("output.wav", audio, model.sample_rate)
|
| 95 |
|
| 96 |
-
#
|
| 97 |
-
audio = model.generate(
|
| 98 |
-
"T2M",
|
| 99 |
-
prompt="Compose a bright jazz swing instrumental with walking bass."
|
| 100 |
-
)
|
| 101 |
-
torchaudio.save("music.wav", audio, model.sample_rate)
|
| 102 |
-
|
| 103 |
-
# Video-to-Audio
|
| 104 |
-
audio = model.generate("V2A", video_path="example.mp4")
|
| 105 |
-
torchaudio.save("v2a_output.wav", audio, model.sample_rate)
|
| 106 |
-
|
| 107 |
-
# Text-to-Speech
|
| 108 |
-
audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
|
| 109 |
-
torchaudio.save("tts_output.wav", audio, model.sample_rate)
|
| 110 |
-
|
| 111 |
-
# Text-to-Speech with voice cloning
|
| 112 |
-
audio = model.generate(
|
| 113 |
-
"TTS",
|
| 114 |
-
prompt="Hello, welcome to Audio-Omni.",
|
| 115 |
-
voice_prompt_path="ref_voice.wav",
|
| 116 |
-
voice_ref_text="This is the reference transcript."
|
| 117 |
-
)
|
| 118 |
-
torchaudio.save("tts_cloned.wav", audio, model.sample_rate)
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
### 3οΈβ£ Editing
|
| 122 |
-
|
| 123 |
-
```python
|
| 124 |
-
# Add a sound
|
| 125 |
audio = model.edit("Add", "input.wav", desc="skateboarding")
|
| 126 |
torchaudio.save("output_add.wav", audio, model.sample_rate)
|
|
|
|
| 127 |
|
| 128 |
-
#
|
| 129 |
-
audio = model.edit("Remove", "input.wav", desc="female singing")
|
| 130 |
-
torchaudio.save("output_remove.wav", audio, model.sample_rate)
|
| 131 |
-
|
| 132 |
-
# Extract a sound
|
| 133 |
-
audio = model.edit("Extract", "input.wav", desc="wood thrush calling")
|
| 134 |
-
torchaudio.save("output_extract.wav", audio, model.sample_rate)
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
"input.wav",
|
| 140 |
-
source_category="playing electric guitar",
|
| 141 |
-
target_category="playing saxophone"
|
| 142 |
-
)
|
| 143 |
-
torchaudio.save("output_transfer.wav", audio, model.sample_rate)
|
| 144 |
-
```
|
| 145 |
|
| 146 |
## π₯οΈ Gradio Demo
|
| 147 |
|
|
@@ -153,34 +88,17 @@ python run_gradio.py \
|
|
| 153 |
--server-port 7777
|
| 154 |
```
|
| 155 |
|
| 156 |
-
Visit `http://localhost:7777` to access the web interface.
|
| 157 |
-
|
| 158 |
-
## π Documentation
|
| 159 |
-
|
| 160 |
-
For detailed documentation, training instructions, and more examples, visit the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).
|
| 161 |
-
|
| 162 |
## π Citation
|
| 163 |
|
| 164 |
```bibtex
|
| 165 |
@article{tian2026audioomni,
|
| 166 |
title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
|
| 167 |
author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
|
| 168 |
-
journal={arXiv preprint arXiv:
|
| 169 |
year={2026}
|
| 170 |
}
|
| 171 |
```
|
| 172 |
|
| 173 |
## π License
|
| 174 |
|
| 175 |
-
CC-BY-NC-4.0 (Non-commercial use only)
|
| 176 |
-
|
| 177 |
-
Commercial use of the model weights requires explicit written authorization from the authors.
|
| 178 |
-
For commercial licensing inquiries, contact: ztianad@connect.ust.hk
|
| 179 |
-
|
| 180 |
-
## π Contact
|
| 181 |
-
|
| 182 |
-
- **Zeyue Tian**: ztianad@connect.ust.hk
|
| 183 |
-
|
| 184 |
-
---
|
| 185 |
-
|
| 186 |
-
**For full installation guide, API reference, and advanced usage, see the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).**
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
| 3 |
+
pipeline_tag: any-to-any
|
| 4 |
tags:
|
| 5 |
+
- text-to-audio
|
| 6 |
+
- text-to-speech
|
| 7 |
+
- audio-editing
|
| 8 |
+
- music
|
| 9 |
+
- speech
|
| 10 |
+
- diffusion
|
| 11 |
+
- multimodal
|
| 12 |
+
- audio-generation
|
| 13 |
---
|
| 14 |
|
| 15 |
# ποΈ Audio-Omni
|
|
|
|
| 18 |
|
| 19 |
[](https://github.com/ZeyueT/Audio-Omni)
|
| 20 |
[](https://zeyuet.github.io/Audio-Omni/)
|
| 21 |
+
[](https://huggingface.co/papers/2604.10708)
|
| 22 |
|
| 23 |
## π Overview
|
| 24 |
|
|
|
|
| 30 |
- **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
|
| 31 |
- **Editing**: Add, Remove, Extract, Style Transfer
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
## π Quick Start
|
| 34 |
|
| 35 |
### Installation
|
|
|
|
| 47 |
huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
|
| 48 |
```
|
| 49 |
|
| 50 |
+
### Sample Usage
|
| 51 |
|
| 52 |
```python
|
| 53 |
from audio_omni import AudioOmni
|
|
|
|
| 55 |
|
| 56 |
# Load model
|
| 57 |
model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
# 1. Understanding
|
|
|
|
| 60 |
response = model.understand(
|
| 61 |
"Describe the sounds in this audio.",
|
| 62 |
audio="example.wav"
|
| 63 |
)
|
| 64 |
+
print(response)
|
| 65 |
|
| 66 |
+
# 2. Generation (Text-to-Audio)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
audio = model.generate("T2A", prompt="A clock ticking.")
|
| 68 |
torchaudio.save("output.wav", audio, model.sample_rate)
|
| 69 |
|
| 70 |
+
# 3. Editing (Add a sound)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
audio = model.edit("Add", "input.wav", desc="skateboarding")
|
| 72 |
torchaudio.save("output_add.wav", audio, model.sample_rate)
|
| 73 |
+
```
|
| 74 |
|
| 75 |
+
## π¦ Model Files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
- `Audio-Omni.json` β Model configuration
|
| 78 |
+
- `model.ckpt` β Model checkpoint (~21 GB)
|
| 79 |
+
- `synchformer_state_dict.pth` β Synchformer checkpoint for video conditioning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
## π₯οΈ Gradio Demo
|
| 82 |
|
|
|
|
| 88 |
--server-port 7777
|
| 89 |
```
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
## π Citation
|
| 92 |
|
| 93 |
```bibtex
|
| 94 |
@article{tian2026audioomni,
|
| 95 |
title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
|
| 96 |
author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
|
| 97 |
+
journal={arXiv preprint arXiv:2604.10708},
|
| 98 |
year={2026}
|
| 99 |
}
|
| 100 |
```
|
| 101 |
|
| 102 |
## π License
|
| 103 |
|
| 104 |
+
CC-BY-NC-4.0 (Non-commercial use only). Commercial use of the model weights requires explicit written authorization from the authors. For commercial licensing inquiries, contact: ztianad@connect.ust.hk
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|