nielsr HF Staff commited on
Commit
da55cd7
Β·
verified Β·
1 Parent(s): 0e1579c

Update pipeline tag to any-to-any and add sample usage

Browse files

This PR improves the model card for Audio-Omni by:
- Updating the `pipeline_tag` to `any-to-any` to accurately reflect its capabilities in understanding, generation, and editing.
- Updating the placeholder arXiv link to the correct paper ([2604.10708](https://huggingface.co/papers/2604.10708)).
- Adding a sample usage section with Python code snippets verified from the official GitHub repository.

Files changed (1) hide show
  1. README.md +22 -104
README.md CHANGED
@@ -1,15 +1,15 @@
1
  ---
2
  license: cc-by-nc-4.0
3
- pipeline_tag: text-to-audio
4
  tags:
5
- - text-to-audio
6
- - text-to-speech
7
- - audio-editing
8
- - music
9
- - speech
10
- - diffusion
11
- - multimodal
12
- - audio-generation
13
  ---
14
 
15
  # πŸŽ›οΈ Audio-Omni
@@ -18,7 +18,7 @@ tags:
18
 
19
  [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/ZeyueT/Audio-Omni)
20
  [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://zeyuet.github.io/Audio-Omni/)
21
- [![arXiv](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/XXXX.XXXXX)
22
 
23
  ## πŸ“– Overview
24
 
@@ -30,12 +30,6 @@ Audio-Omni is the first end-to-end framework that unifies **understanding**, **g
30
  - **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
31
  - **Editing**: Add, Remove, Extract, Style Transfer
32
 
33
- ## πŸ“¦ Model Files
34
-
35
- - `Audio-Omni.json` β€” Model configuration
36
- - `model.ckpt` β€” Model checkpoint (~21 GB)
37
- - `synchformer_state_dict.pth` β€” Synchformer checkpoint for video conditioning
38
-
39
  ## πŸš€ Quick Start
40
 
41
  ### Installation
@@ -53,7 +47,7 @@ conda install -c conda-forge ffmpeg libsndfile
53
  huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
54
  ```
55
 
56
- ### Python API
57
 
58
  ```python
59
  from audio_omni import AudioOmni
@@ -61,87 +55,28 @@ import torchaudio
61
 
62
  # Load model
63
  model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
64
- ```
65
-
66
- ### 1️⃣ Understanding
67
 
68
- ```python
69
- # Audio understanding
70
  response = model.understand(
71
  "Describe the sounds in this audio.",
72
  audio="example.wav"
73
  )
 
74
 
75
- # Video understanding
76
- response = model.understand(
77
- "What is happening in this video?",
78
- video="example.mp4"
79
- )
80
-
81
- # Audio + Video understanding
82
- response = model.understand(
83
- "Does the audio match the video?",
84
- audio="example.wav",
85
- video="example.mp4"
86
- )
87
- ```
88
-
89
- ### 2️⃣ Generation
90
-
91
- ```python
92
- # Text-to-Audio
93
  audio = model.generate("T2A", prompt="A clock ticking.")
94
  torchaudio.save("output.wav", audio, model.sample_rate)
95
 
96
- # Text-to-Music
97
- audio = model.generate(
98
- "T2M",
99
- prompt="Compose a bright jazz swing instrumental with walking bass."
100
- )
101
- torchaudio.save("music.wav", audio, model.sample_rate)
102
-
103
- # Video-to-Audio
104
- audio = model.generate("V2A", video_path="example.mp4")
105
- torchaudio.save("v2a_output.wav", audio, model.sample_rate)
106
-
107
- # Text-to-Speech
108
- audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
109
- torchaudio.save("tts_output.wav", audio, model.sample_rate)
110
-
111
- # Text-to-Speech with voice cloning
112
- audio = model.generate(
113
- "TTS",
114
- prompt="Hello, welcome to Audio-Omni.",
115
- voice_prompt_path="ref_voice.wav",
116
- voice_ref_text="This is the reference transcript."
117
- )
118
- torchaudio.save("tts_cloned.wav", audio, model.sample_rate)
119
- ```
120
-
121
- ### 3️⃣ Editing
122
-
123
- ```python
124
- # Add a sound
125
  audio = model.edit("Add", "input.wav", desc="skateboarding")
126
  torchaudio.save("output_add.wav", audio, model.sample_rate)
 
127
 
128
- # Remove a sound
129
- audio = model.edit("Remove", "input.wav", desc="female singing")
130
- torchaudio.save("output_remove.wav", audio, model.sample_rate)
131
-
132
- # Extract a sound
133
- audio = model.edit("Extract", "input.wav", desc="wood thrush calling")
134
- torchaudio.save("output_extract.wav", audio, model.sample_rate)
135
 
136
- # Style transfer
137
- audio = model.edit(
138
- "Style Transfer",
139
- "input.wav",
140
- source_category="playing electric guitar",
141
- target_category="playing saxophone"
142
- )
143
- torchaudio.save("output_transfer.wav", audio, model.sample_rate)
144
- ```
145
 
146
  ## πŸ–₯️ Gradio Demo
147
 
@@ -153,34 +88,17 @@ python run_gradio.py \
153
  --server-port 7777
154
  ```
155
 
156
- Visit `http://localhost:7777` to access the web interface.
157
-
158
- ## πŸ“š Documentation
159
-
160
- For detailed documentation, training instructions, and more examples, visit the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).
161
-
162
  ## πŸ“ Citation
163
 
164
  ```bibtex
165
  @article{tian2026audioomni,
166
  title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
167
  author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
168
- journal={arXiv preprint arXiv:submit/7470507},
169
  year={2026}
170
  }
171
  ```
172
 
173
  ## πŸ“„ License
174
 
175
- CC-BY-NC-4.0 (Non-commercial use only)
176
-
177
- Commercial use of the model weights requires explicit written authorization from the authors.
178
- For commercial licensing inquiries, contact: ztianad@connect.ust.hk
179
-
180
- ## πŸ“­ Contact
181
-
182
- - **Zeyue Tian**: ztianad@connect.ust.hk
183
-
184
- ---
185
-
186
- **For full installation guide, API reference, and advanced usage, see the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).**
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ pipeline_tag: any-to-any
4
  tags:
5
+ - text-to-audio
6
+ - text-to-speech
7
+ - audio-editing
8
+ - music
9
+ - speech
10
+ - diffusion
11
+ - multimodal
12
+ - audio-generation
13
  ---
14
 
15
  # πŸŽ›οΈ Audio-Omni
 
18
 
19
  [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/ZeyueT/Audio-Omni)
20
  [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://zeyuet.github.io/Audio-Omni/)
21
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-red)](https://huggingface.co/papers/2604.10708)
22
 
23
  ## πŸ“– Overview
24
 
 
30
  - **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
31
  - **Editing**: Add, Remove, Extract, Style Transfer
32
 
 
 
 
 
 
 
33
  ## πŸš€ Quick Start
34
 
35
  ### Installation
 
47
  huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
48
  ```
49
 
50
+ ### Sample Usage
51
 
52
  ```python
53
  from audio_omni import AudioOmni
 
55
 
56
  # Load model
57
  model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
 
 
 
58
 
59
+ # 1. Understanding
 
60
  response = model.understand(
61
  "Describe the sounds in this audio.",
62
  audio="example.wav"
63
  )
64
+ print(response)
65
 
66
+ # 2. Generation (Text-to-Audio)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  audio = model.generate("T2A", prompt="A clock ticking.")
68
  torchaudio.save("output.wav", audio, model.sample_rate)
69
 
70
+ # 3. Editing (Add a sound)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  audio = model.edit("Add", "input.wav", desc="skateboarding")
72
  torchaudio.save("output_add.wav", audio, model.sample_rate)
73
+ ```
74
 
75
+ ## πŸ“¦ Model Files
 
 
 
 
 
 
76
 
77
+ - `Audio-Omni.json` β€” Model configuration
78
+ - `model.ckpt` β€” Model checkpoint (~21 GB)
79
+ - `synchformer_state_dict.pth` β€” Synchformer checkpoint for video conditioning
 
 
 
 
 
 
80
 
81
  ## πŸ–₯️ Gradio Demo
82
 
 
88
  --server-port 7777
89
  ```
90
 
 
 
 
 
 
 
91
  ## πŸ“ Citation
92
 
93
  ```bibtex
94
  @article{tian2026audioomni,
95
  title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
96
  author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
97
+ journal={arXiv preprint arXiv:2604.10708},
98
  year={2026}
99
  }
100
  ```
101
 
102
  ## πŸ“„ License
103
 
104
+ CC-BY-NC-4.0 (Non-commercial use only). Commercial use of the model weights requires explicit written authorization from the authors. For commercial licensing inquiries, contact: ztianad@connect.ust.hk