연구2026년 4월 7일

Position: Science of AI Evaluation Requires Item-level Benchmark Data

arXiv:2604.03244v1 Announce Type: new Abstract: AI evaluations have become the primary evidence for deploying generative AI systems across highstakes domains. However, current evaluation paradigms often exhibit systemic validity failures.

이 콘텐츠는 ArXiv AI 원본 기사의 요약입니다. 전문은 원본 사이트에서 확인해주세요.

원문 기사 보기 →