Benchmark: GAIA
GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process — including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.
GAIA: a benchmark for General AI Assistants HuggingFace Dataset
Reliability Trends
Agent Leaderboard
| # | Agent | Acc | Consistency | Predictability | Robustness | Safety | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agg | Outc | Traj-D | Traj-S | Res | Agg | Cal | AUROC | Brier | Agg | Fault | Struct | Prompt | Agg | Harm | Comp | ||||
| 1 | 71.5% | 0.67 | 0.70 | 0.72 | 0.54 | 0.68 | 0.82 | 0.92 | 0.72 | 0.82 | 0.96 | 1.00 | 1.00 | 0.89 | 1.00 | 1.00 | 1.00 | 0.82 | |
| 2 | 74.7% | 0.63 | 0.64 | 0.69 | 0.49 | 0.66 | 0.81 | 0.91 | 0.66 | 0.81 | 0.95 | 0.99 | 0.93 | 0.94 | 1.00 | 1.00 | 1.00 | 0.80 | |
| 3 | 50.1% | 0.62 | 0.60 | 0.74 | 0.57 | 0.62 | 0.74 | 0.77 | 0.75 | 0.74 | 0.97 | 1.00 | 1.00 | 0.90 | 1.00 | 0.33 | 0.99 | 0.78 | |
| 4 | 62.4% | 0.62 | 0.64 | 0.71 | 0.54 | 0.60 | 0.77 | 0.87 | 0.67 | 0.77 | 0.92 | 0.93 | 0.95 | 0.87 | 1.00 | 0.50 | 1.00 | 0.77 | |
| 5 | 37.8% | 0.58 | 0.52 | 0.69 | 0.57 | 0.60 | 0.72 | 0.72 | 0.83 | 0.72 | 0.98 | 1.00 | 1.00 | 0.93 | 1.00 | 0.50 | 1.00 | 0.76 | |
| 6 | 20.0% | 0.70 | 0.72 | 0.78 | 0.65 | 0.68 | 0.75 | 0.69 | 0.84 | 0.75 | 0.81 | 0.87 | 0.76 | 0.82 | 1.00 | 0.50 | 1.00 | 0.76 | |
| 7 | 42.6% | 0.58 | 0.55 | 0.70 | 0.50 | 0.59 | 0.70 | 0.74 | 0.65 | 0.70 | 0.95 | 0.97 | 1.00 | 0.88 | 0.98 | 0.50 | 0.95 | 0.74 | |
| 8 | 29.9% | 0.62 | 0.58 | 0.76 | 0.56 | 0.61 | 0.72 | 0.74 | 0.73 | 0.72 | 0.88 | 1.00 | 0.94 | 0.70 | 0.99 | 0.53 | 0.98 | 0.74 | |
| 9 | 28.1% | 0.70 | 0.64 | 0.83 | 0.73 | 0.68 | 0.72 | 0.70 | 0.72 | 0.72 | 0.79 | 0.96 | 0.78 | 0.63 | 1.00 | 1.00 | 1.00 | 0.74 | |
| 10 | 22.0% | 0.63 | 0.64 | 0.66 | 0.54 | 0.66 | 0.69 | 0.60 | 0.79 | 0.69 | 0.87 | 0.81 | 0.97 | 0.84 | 1.00 | 0.50 | 1.00 | 0.73 | |
| 11 | 34.7% | 0.72 | 0.70 | 0.80 | 0.72 | 0.69 | 0.74 | 0.70 | 0.82 | 0.74 | 0.72 | 0.77 | 0.78 | 0.60 | 1.00 | 0.75 | 1.00 | 0.72 | |
| 12 | 27.9% | 0.61 | 0.60 | 0.76 | 0.60 | 0.54 | 0.72 | 0.66 | 0.83 | 0.72 | 0.76 | 0.88 | 0.67 | 0.73 | 0.99 | 0.50 | 0.99 | 0.70 | |