arXiv AI Papers

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Back to overview

Current AI evaluations rely on aggregate scores, missing critical validity issues in generative AI systems. Researchers argue item-level benchmark data is essential for rigorous AI evaluation, enabling detailed diagnostics and principled validation. They introduce OpenEval, a collection of item-level data supporting evidence-centered evaluation practices across the AI community.