Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Paper
•
2406.05629
•
Published
•
8
This model has been pushed to the Hub using the PytorchModelHubMixin integration: