HAL Reliability Evaluation

Benchmark: GAIA

GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process — including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.

GAIA: a benchmark for General AI Assistants HuggingFace Dataset

Reliability failure analysis — concrete examples of inconsistency, overconfidence, and robustness failures →

Reliability Trends

Agent Leaderboard

#	Agent	Acc	Reliability	Consistency					Predictability				Robustness				Safety
#	Agent	Acc	Reliability	Agg	Outc	Traj-D	Traj-S	Res	Agg	Cal	AUROC	Brier	Agg	Fault	Struct	Prompt	Agg	Harm	Comp
1	Claude Opus 4.5	68.5%	0.85	0.81	0.84	0.89	0.75	0.77	0.84	0.97	0.80	0.84	0.91	0.99	0.98	0.77	1.00	1.00	1.00
2	Claude Opus 4.7	73.3%	0.84	0.81	0.84	0.87	0.74	0.77	0.73	0.76	0.63	0.73	0.99	1.00	1.00	0.96	1.00	0.50	0.99
3	Gemini 3.5 Flash	79.2%	0.84	0.75	0.84	0.82	0.61	0.69	0.80	0.79	0.57	0.80	0.96	1.00	1.00	0.88	1.00	1.00	1.00
4	Gemini 3.1 Pro	76.2%	0.82	0.76	0.83	0.81	0.61	0.73	0.78	0.79	0.72	0.78	0.94	0.99	0.96	0.86	1.00	0.50	1.00
5	Claude 3.5 Haiku	25.7%	0.82	0.80	0.82	0.89	0.76	0.75	0.72	0.67	0.76	0.72	0.94	1.00	1.00	0.82	1.00	0.50	1.00
6	Claude Sonnet 4	54.7%	0.82	0.78	0.76	0.87	0.75	0.76	0.73	0.77	0.71	0.73	0.94	1.00	1.00	0.83	1.00	1.00	1.00
7	O1	34.4%	0.79	0.68	0.58	0.85	0.78	0.65	0.68	0.66	0.76	0.68	1.00	1.00	1.00	1.00	1.00	0.50	1.00
8	GPT-5.5	62.8%	0.79	0.61	0.60	0.71	0.53	0.61	0.80	0.88	0.76	0.80	0.95	1.00	0.96	0.91	1.00	0.50	0.99
9	Gemini 2.5 Pro	52.5%	0.78	0.72	0.73	0.87	0.71	0.65	0.71	0.71	0.75	0.71	0.91	0.95	0.97	0.82	1.00	0.50	1.00
10	GPT-4o Mini	26.3%	0.76	0.73	0.75	0.87	0.72	0.65	0.58	0.52	0.72	0.58	0.97	1.00	1.00	0.92	1.00	1.00	1.00
11	GPT-4 Turbo	30.8%	0.76	0.76	0.73	0.90	0.79	0.71	0.64	0.60	0.75	0.64	0.87	1.00	0.91	0.71	1.00	1.00	1.00
12	Gemini 2.5 Flash	46.7%	0.74	0.70	0.73	0.82	0.64	0.64	0.64	0.63	0.77	0.64	0.87	0.97	0.96	0.69	1.00	0.50	1.00
13	GPT-5.2	33.2%	0.72	0.59	0.58	0.77	0.54	0.54	0.78	0.81	0.80	0.78	0.78	0.62	1.00	0.73	1.00	0.50	1.00
14	GPT-5.2 (medium)	31.8%	0.72	0.62	0.65	0.74	0.56	0.56	0.61	0.61	0.61	0.61	0.91	1.00	1.00	0.74	0.99	0.50	0.98
15	Claude 3 Haiku	12.9%	0.69	0.71	0.84	0.78	0.55	0.61	0.49	0.38	0.73	0.49	0.86	1.00	0.94	0.66	0.99	0.50	0.99

Benchmark: GAIA

Consistency →

Predictability →

Robustness →

Safety →

Reliability Trends

Agent Leaderboard