GAIA Benchmark

GAIA is a benchmark for General AI Assistants that requires a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency. It contains 450 questions with unambiguous answers, requiring different levels of tooling and autonomy to solve. It is divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities. We evaluate on the public validation set of 165 questions.

Paper: GAIA: a benchmark for General AI Assistants (Mialon et al., 2023)

450
Total Questions
165
Questions in Public Validation Set
8
Agents Evaluated

Key Features of GAIA

Multi-Level Evaluation

Tasks are organized into three difficulty levels, testing increasingly complex cognitive abilities.

Diverse Task Types

Covers a wide range of tasks from basic reasoning to complex problem-solving and creative generation.

GAIA Leaderboard

Rank Agent Models Verified Accuracy Level 1 Level 2 Level 3 Cost (USD) Runs Traces
1 claude-3-5-sonnet-20241022 57.58% 67.92% 59.30% 30.77% $260.19 1 Download
2 claude-3-7-sonnet-20250219 56.36% 69.81% 54.65% 34.62% $409.01 1 Download
3 o1-preview-2024-09-12 56.36% 69.81% 55.81% 30.77% $641.52 1 Download
4 o3-mini-2025-01-31 med. 49.70% 60.38% 51.16% 23.08% $47.72 1 Download
5 o1-mini-2024-09-12 36.97% 52.83% 34.88% 11.54% $59.25 1 Download
6 gpt-4o-2024-11-20 34.55% 47.17% 31.40% 19.23% $209.12 1 Download
7 gpt-4o-mini-2024-07-18 13.94% (-0.61/+0.61) 28.30% 9.30% 0.00% $18.38 (-0.63/+0.63) 2 Download
8 Meta-Llama-3.1-405B-Instruct-Turbo 12.12% 20.75% 8.14% 7.69% $128.78 1 Download

Accuracy vs. Cost Frontier for GAIA

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for GAIA

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Failure Categories

Distribution of Failures

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

together/meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo

Active
$
/1M tokens
$
/1M tokens

anthropic/claude-3-5-sonnet-20241022

Active
$
/1M tokens
$
/1M tokens

anthropic/claude-3-7-sonnet-20250219

Active
$
/1M tokens
$
/1M tokens

openai/gpt-4o-2024-11-20

Active
$
/1M tokens
$
/1M tokens

openai/gpt-4o-mini-2024-07-18

Active
$
/1M tokens
$
/1M tokens

openai/o1-mini-2024-09-12

Active
$
/1M tokens
$
/1M tokens

openai/o1-preview-2024-09-12

Active
$
/1M tokens
$
/1M tokens

openai/o3-mini-2025-01-31

Active
$
/1M tokens
$
/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on GAIA? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of GAIA tasks and their requirements:

View Tasks