prometheus-eval

university

AI & ML interests

None defined yet.

Recent Activity

amphora submitted a paper 1 day ago

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

seungone updated a dataset 23 days ago

prometheus-eval/peerreview-bench

seungone published a dataset about 1 month ago

prometheus-eval/peerreview-bench

View all activity

submitted a paper to Daily Papers 1 day ago

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Paper • 2605.09063 • Published 4 days ago • 68

updated a dataset 23 days ago

prometheus-eval/peerreview-bench

Viewer • Updated 23 days ago • 27.4k • 490

published a dataset about 1 month ago

prometheus-eval/peerreview-bench

Viewer • Updated 23 days ago • 27.4k • 490

authored a paper about 2 months ago

Safe and Scalable Web Agent Learning via Recreated Websites

Paper • 2603.10505 • Published Mar 11 • 27

submitted a paper to Daily Papers about 2 months ago

Safe and Scalable Web Agent Learning via Recreated Websites

Paper • 2603.10505 • Published Mar 11 • 27

submitted a paper to Daily Papers 3 months ago

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Paper • 2602.06291 • Published Feb 6 • 24

authored 2 papers 4 months ago

Efficient Long Context Language Model Retrieval with Compression

Paper • 2412.18232 • Published Dec 24, 2024 • 1

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

Paper • 2601.07226 • Published Jan 12 • 33

submitted a paper to Daily Papers 4 months ago

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

Paper • 2601.07226 • Published Jan 12 • 33

authored 5 papers 4 months ago

Measuring Sycophancy of Language Models in Multi-turn Dialogues

Paper • 2505.23840 • Published May 28, 2025 • 3

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Paper • 2507.00432 • Published Jul 1, 2025 • 79

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Paper • 2508.13141 • Published Aug 18, 2025

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Paper • 2509.21451 • Published Sep 25, 2025

SPICE: Self-Play In Corpus Environments Improves Reasoning

Paper • 2510.24684 • Published Oct 28, 2025 • 18

updated a dataset 5 months ago

prometheus-eval/nature_papers_1202

Viewer • Updated Dec 2, 2025 • 31.6k • 725 • 1

published a dataset 5 months ago

prometheus-eval/nature_papers_1202

Viewer • Updated Dec 2, 2025 • 31.6k • 725 • 1

updated a dataset 5 months ago

prometheus-eval/nature_crawled_papers_1202

Viewer • Updated Dec 2, 2025 • 739 • 1

published a dataset 5 months ago

prometheus-eval/nature_crawled_papers_1202

Viewer • Updated Dec 2, 2025 • 739 • 1

authored a paper 5 months ago

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Paper • 2511.22173 • Published Nov 27, 2025 • 15

updated a dataset 6 months ago

prometheus-eval/nature_papers_1125

Updated Nov 25, 2025 • 522