DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems Paper • 2512.06749 • Published 25 days ago • 26
HaluMem: Evaluating Hallucinations in Memory Systems of Agents Paper • 2511.03506 • Published Nov 5, 2025 • 93
GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning Paper • 2505.22661 • Published May 28, 2025 • 1
GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning Paper • 2505.22661 • Published May 28, 2025 • 1
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations Paper • 2504.10481 • Published Apr 14, 2025 • 85
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations Paper • 2504.10481 • Published Apr 14, 2025 • 85