HAL Reliability Evaluation

Benchmark: GAIA

GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process — including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.

GAIA: a benchmark for General AI Assistants HuggingFace Dataset

Reliability Trends

Agent Leaderboard

#	Agent	Acc	Consistency					Predictability				Robustness				Safety			Overall
#	Agent	Acc	Agg	Outc	Traj-D	Traj-S	Res	Agg	Cal	AUROC	Brier	Agg	Fault	Struct	Prompt	Agg	Harm	Comp	Overall
1	Claude Opus 4.5	71.5%	0.67	0.70	0.72	0.54	0.68	0.82	0.92	0.72	0.82	0.96	1.00	1.00	0.89	1.00	1.00	1.00	0.82
2	Claude Sonnet 4.5	74.7%	0.63	0.64	0.69	0.49	0.66	0.81	0.91	0.66	0.81	0.95	0.99	0.93	0.94	1.00	1.00	1.00	0.80
3	Gemini 2.5 Pro	50.1%	0.62	0.60	0.74	0.57	0.62	0.74	0.77	0.75	0.74	0.97	1.00	1.00	0.90	1.00	0.33	0.99	0.78
4	Claude 3.7 Sonnet	62.4%	0.62	0.64	0.71	0.54	0.60	0.77	0.87	0.67	0.77	0.92	0.93	0.95	0.87	1.00	0.50	1.00	0.77
5	Gemini 2.5 Flash	37.8%	0.58	0.52	0.69	0.57	0.60	0.72	0.72	0.83	0.72	0.98	1.00	1.00	0.93	1.00	0.50	1.00	0.76
6	GPT-4 Turbo	20.0%	0.70	0.72	0.78	0.65	0.68	0.75	0.69	0.84	0.75	0.81	0.87	0.76	0.82	1.00	0.50	1.00	0.76
7	GPT-5.2 (medium)	42.6%	0.58	0.55	0.70	0.50	0.59	0.70	0.74	0.65	0.70	0.95	0.97	1.00	0.88	0.98	0.50	0.95	0.74
8	GPT-5.2	29.9%	0.62	0.58	0.76	0.56	0.61	0.72	0.74	0.73	0.72	0.88	1.00	0.94	0.70	0.99	0.53	0.98	0.74
9	Claude 3.5 Haiku	28.1%	0.70	0.64	0.83	0.73	0.68	0.72	0.70	0.72	0.72	0.79	0.96	0.78	0.63	1.00	1.00	1.00	0.74
10	GPT-4o Mini	22.0%	0.63	0.64	0.66	0.54	0.66	0.69	0.60	0.79	0.69	0.87	0.81	0.97	0.84	1.00	0.50	1.00	0.73
11	O1	34.7%	0.72	0.70	0.80	0.72	0.69	0.74	0.70	0.82	0.74	0.72	0.77	0.78	0.60	1.00	0.75	1.00	0.72
12	Gemini 2.0 Flash	27.9%	0.61	0.60	0.76	0.60	0.54	0.72	0.66	0.83	0.72	0.76	0.88	0.67	0.73	0.99	0.50	0.99	0.70

Benchmark: GAIA

Consistency →

Predictability →

Robustness →

Safety →

Reliability Trends

Agent Leaderboard