HAL Reliability Evaluation

Robustness on GAIA

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

#	Agent	Acc	Robustness Agg	Fault	Struct	Prompt	Overall
1	Gemini 2.5 Flash	37.8%	0.98	1.00	1.00	0.93	0.76
2	Gemini 2.5 Pro	50.1%	0.97	1.00	1.00	0.90	0.78
3	Claude Opus 4.5	71.5%	0.96	1.00	1.00	0.89	0.82
4	Claude Sonnet 4.5	74.7%	0.95	0.99	0.93	0.94	0.80
5	GPT-5.2 (medium)	42.6%	0.95	0.97	1.00	0.88	0.74
6	Claude 3.7 Sonnet	62.4%	0.92	0.93	0.95	0.87	0.77
7	GPT-5.2	29.9%	0.88	1.00	0.94	0.70	0.74
8	GPT-4o Mini	22.0%	0.87	0.81	0.97	0.84	0.73
9	GPT-4 Turbo	20.0%	0.81	0.87	0.76	0.82	0.76
10	Claude 3.5 Haiku	28.1%	0.79	0.96	0.78	0.63	0.74
11	Gemini 2.0 Flash	27.9%	0.76	0.88	0.67	0.73	0.70
12	O1	34.7%	0.72	0.77	0.78	0.60	0.72