A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
Abstract
Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies.
The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.
Community
This survey provides a timely and comprehensive overview of trustworthiness issues in Large Audio Language Models. It clearly identifies the unique risks introduced by continuous acoustic inputs, including cross-modal attacks, acoustic backdoors, privacy leakage, hallucination, and fairness concerns. The proposed roadmap toward defense-in-depth architectures and intrinsic representation engineering is valuable. A stronger empirical comparison of existing LALMs and their defense coverage would further improve the survey. Overall, this is a useful reference for researchers working on trustworthy audio-language intelligence.
Most conversations about Multimodal LLMs and universal auditory intelligence focus purely on model capabilities and performance scaling. In our new comprehensive survey, "A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook", we make a critical argument: for real-world deployment, empirical performance means nothing without intrinsic trustworthiness. The evidence is hard to ignore. Recent benchmarks reveal that the transition to unified end-to-end audio frameworks has dramatically expanded the attack surface.
We evaluate the state-of-the-art landscape across six analytical pillars: Hallucination, Robustness, Safety, Privacy, Fairness, and Authentication. The survey systematically uncovers a profound imbalance between a mature offensive ecosystem and fragmented, reactive defenses. To bridge this chasm, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering.
If you're building real-time full-duplex conversational agents, voice assistants, speech security systems, or anything that interacts with live acoustic data, we hope you'll find something vital here.
๐ Paper: https://arxiv.org/abs/2605.20266
๐ป Project: https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs
๐ Hugging Face: https://huggingface.co/papers/2605.20266
Huge thanks to my incredible co-authors
Get this paper in your agent:
hf papers read 2605.20266 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper