Benchmarks and challenges
updated
PhD Knowledge Not Required: A Reasoning Challenge for Large Language
Models
Paper
•
2502.01584
•
Published
•
9
CODESIM: Multi-Agent Code Generation and Problem Solving through
Simulation-Driven Planning and Debugging
Paper
•
2502.05664
•
Published
•
24
Craw4LLM: Efficient Web Crawling for LLM Pretraining
Paper
•
2502.13347
•
Published
•
30
Can Large Language Models Help Multimodal Language Analysis? MMLA: A
Comprehensive Benchmark
Paper
•
2504.16427
•
Published
•
18
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in
Large Language Models
Paper
•
2504.16074
•
Published
•
36
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
•
2504.06148
•
Published
•
13
DiaTool-DPO: Multi-Turn Direct Preference Optimization for
Tool-Augmented Large Language Models
Paper
•
2504.02882
•
Published
•
7
Pixels, Patterns, but No Poetry: To See The World like Humans
Paper
•
2507.16863
•
Published
•
68
DeepResearch Arena: The First Exam of LLMs' Research Abilities via
Seminar-Grounded Tasks
Paper
•
2509.01396
•
Published
•
57
Symbolic Graphics Programming with Large Language Models
Paper
•
2509.05208
•
Published
•
46