HAL Reliability Evaluation

Robustness on GAIA

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

#	Agent	Acc	Reliability	Robustness Agg	Fault	Struct	Prompt
1	Gemini 2.5 Flash	37.8%	0.76	0.98	1.00	1.00	0.93
2	Gemini 2.5 Pro	50.1%	0.78	0.97	1.00	1.00	0.90
3	Claude Opus 4.5	71.5%	0.82	0.96	1.00	1.00	0.89
4	Claude Sonnet 4.5	74.7%	0.80	0.95	0.99	0.93	0.94
5	GPT-5.2 (medium)	42.6%	0.74	0.95	0.97	1.00	0.88
6	Claude 3.7 Sonnet	62.4%	0.77	0.92	0.93	0.95	0.87
7	GPT-5.2	29.9%	0.74	0.88	1.00	0.94	0.70
8	GPT-4o Mini	22.0%	0.73	0.87	0.81	0.97	0.84
9	GPT-4 Turbo	20.0%	0.76	0.81	0.87	0.76	0.82
10	Claude 3.5 Haiku	28.1%	0.74	0.79	0.96	0.78	0.63
11	Gemini 2.0 Flash	27.9%	0.70	0.76	0.88	0.67	0.73
12	O1	34.7%	0.72	0.72	0.77	0.78	0.60