HAL Reliability Evaluation

Robustness on τ-bench (airline, original)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

#	Agent	Acc	Robustness Agg	Fault	Struct	Prompt	Overall
1	Claude Sonnet 4.5	54.4%	0.98	0.97	1.00	0.97	0.79
2	GPT-4 Turbo	35.6%	0.98	0.98	0.96	1.00	0.69
3	GPT-5.2	42.0%	0.97	1.00	1.00	0.92	0.75
4	Claude 3.7 Sonnet	43.6%	0.96	1.00	1.00	0.89	0.72
5	Gemini 2.5 Pro	52.8%	0.96	0.97	0.98	0.92	0.74
6	Gemini 2.5 Flash	47.2%	0.95	1.00	0.97	0.88	0.70
7	GPT-5.2 (xhigh)	51.6%	0.95	1.00	1.00	0.84	0.76
8	Gemini 2.0 Flash	32.0%	0.94	0.98	1.00	0.85	0.67
9	Gemini 3.0 Pro	58.8%	0.94	1.00	0.92	0.91	0.77
10	O1	49.6%	0.94	0.95	1.00	0.86	0.76
11	Claude Opus 4.5	58.4%	0.93	0.95	0.96	0.89	0.80
12	GPT-4o Mini	21.3%	0.92	1.00	0.84	0.91	0.67
13	Claude 3.5 Haiku	29.6%	0.88	0.86	1.00	0.77	0.68