Benchmark: GAIA

GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process — including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.

GAIA: a benchmark for General AI Assistants HuggingFace Dataset

Reliability failure analysis — concrete examples of inconsistency, overconfidence, and robustness failures →

Reliability Trends
Agent Leaderboard
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.5 68.5% 0.85 0.81 0.84 0.89 0.75 0.77 0.84 0.97 0.80 0.84 0.91 0.99 0.98 0.77 1.00 1.00 1.00
2 Claude Opus 4.7 73.3% 0.84 0.81 0.84 0.87 0.74 0.77 0.73 0.76 0.63 0.73 0.99 1.00 1.00 0.96 1.00 0.50 0.99
3 Gemini 3.5 Flash 79.2% 0.84 0.75 0.84 0.82 0.61 0.69 0.80 0.79 0.57 0.80 0.96 1.00 1.00 0.88 1.00 1.00 1.00
4 Gemini 3.1 Pro 76.2% 0.82 0.76 0.83 0.81 0.61 0.73 0.78 0.79 0.72 0.78 0.94 0.99 0.96 0.86 1.00 0.50 1.00
5 Claude 3.5 Haiku 25.7% 0.82 0.80 0.82 0.89 0.76 0.75 0.72 0.67 0.76 0.72 0.94 1.00 1.00 0.82 1.00 0.50 1.00
6 Claude Sonnet 4 54.7% 0.82 0.78 0.76 0.87 0.75 0.76 0.73 0.77 0.71 0.73 0.94 1.00 1.00 0.83 1.00 1.00 1.00
7 O1 34.4% 0.79 0.68 0.58 0.85 0.78 0.65 0.68 0.66 0.76 0.68 1.00 1.00 1.00 1.00 1.00 0.50 1.00
8 GPT-5.5 62.8% 0.79 0.61 0.60 0.71 0.53 0.61 0.80 0.88 0.76 0.80 0.95 1.00 0.96 0.91 1.00 0.50 0.99
9 Gemini 2.5 Pro 52.5% 0.78 0.72 0.73 0.87 0.71 0.65 0.71 0.71 0.75 0.71 0.91 0.95 0.97 0.82 1.00 0.50 1.00
10 GPT-4o Mini 26.3% 0.76 0.73 0.75 0.87 0.72 0.65 0.58 0.52 0.72 0.58 0.97 1.00 1.00 0.92 1.00 1.00 1.00
11 GPT-4 Turbo 30.8% 0.76 0.76 0.73 0.90 0.79 0.71 0.64 0.60 0.75 0.64 0.87 1.00 0.91 0.71 1.00 1.00 1.00
12 Gemini 2.5 Flash 46.7% 0.74 0.70 0.73 0.82 0.64 0.64 0.64 0.63 0.77 0.64 0.87 0.97 0.96 0.69 1.00 0.50 1.00
13 GPT-5.2 33.2% 0.72 0.59 0.58 0.77 0.54 0.54 0.78 0.81 0.80 0.78 0.78 0.62 1.00 0.73 1.00 0.50 1.00
14 GPT-5.2 (medium) 31.8% 0.72 0.62 0.65 0.74 0.56 0.56 0.61 0.61 0.61 0.61 0.91 1.00 1.00 0.74 0.99 0.50 0.98
15 Claude 3 Haiku 12.9% 0.69 0.71 0.84 0.78 0.55 0.61 0.49 0.38 0.73 0.49 0.86 1.00 0.94 0.66 0.99 0.50 0.99