Benchmark: GAIA

GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process — including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.

GAIA: a benchmark for General AI Assistants HuggingFace Dataset

Reliability Trends
Agent Leaderboard
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.5 71.5% 0.67 0.70 0.72 0.54 0.68 0.82 0.92 0.72 0.82 0.96 1.00 1.00 0.89 1.00 1.00 1.00 0.82
2 Claude Sonnet 4.5 74.7% 0.63 0.64 0.69 0.49 0.66 0.81 0.91 0.66 0.81 0.95 0.99 0.93 0.94 1.00 1.00 1.00 0.80
3 Gemini 2.5 Pro 50.1% 0.62 0.60 0.74 0.57 0.62 0.74 0.77 0.75 0.74 0.97 1.00 1.00 0.90 1.00 0.33 0.99 0.78
4 Claude 3.7 Sonnet 62.4% 0.62 0.64 0.71 0.54 0.60 0.77 0.87 0.67 0.77 0.92 0.93 0.95 0.87 1.00 0.50 1.00 0.77
5 Gemini 2.5 Flash 37.8% 0.58 0.52 0.69 0.57 0.60 0.72 0.72 0.83 0.72 0.98 1.00 1.00 0.93 1.00 0.50 1.00 0.76
6 GPT-4 Turbo 20.0% 0.70 0.72 0.78 0.65 0.68 0.75 0.69 0.84 0.75 0.81 0.87 0.76 0.82 1.00 0.50 1.00 0.76
7 GPT-5.2 (medium) 42.6% 0.58 0.55 0.70 0.50 0.59 0.70 0.74 0.65 0.70 0.95 0.97 1.00 0.88 0.98 0.50 0.95 0.74
8 GPT-5.2 29.9% 0.62 0.58 0.76 0.56 0.61 0.72 0.74 0.73 0.72 0.88 1.00 0.94 0.70 0.99 0.53 0.98 0.74
9 Claude 3.5 Haiku 28.1% 0.70 0.64 0.83 0.73 0.68 0.72 0.70 0.72 0.72 0.79 0.96 0.78 0.63 1.00 1.00 1.00 0.74
10 GPT-4o Mini 22.0% 0.63 0.64 0.66 0.54 0.66 0.69 0.60 0.79 0.69 0.87 0.81 0.97 0.84 1.00 0.50 1.00 0.73
11 O1 34.7% 0.72 0.70 0.80 0.72 0.69 0.74 0.70 0.82 0.74 0.72 0.77 0.78 0.60 1.00 0.75 1.00 0.72
12 Gemini 2.0 Flash 27.9% 0.61 0.60 0.76 0.60 0.54 0.72 0.66 0.83 0.72 0.76 0.88 0.67 0.73 0.99 0.50 0.99 0.70