HAL Reliability Evaluation

Robustness on τ-bench (airline, clean)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

#	Agent	Acc	Robustness Agg	Fault	Struct	Prompt	Overall
1	Claude Sonnet 4.5	78.5%	0.99	1.00	1.00	0.98	0.85
2	Gemini 2.0 Flash	44.6%	0.98	0.98	1.00	0.98	0.70
3	Gemini 3.0 Pro	80.8%	0.98	1.00	1.00	0.95	0.85
4	Claude 3.7 Sonnet	56.2%	0.97	1.00	1.00	0.91	0.75
5	GPT-5.2 (xhigh)	67.7%	0.96	1.00	1.00	0.89	0.81
6	Gemini 2.5 Flash	65.4%	0.95	1.00	1.00	0.86	0.73
7	GPT-5.2	59.2%	0.95	1.00	1.00	0.84	0.80
8	Claude Opus 4.5	83.1%	0.94	0.97	0.93	0.93	0.88
9	O1	72.3%	0.93	0.95	1.00	0.85	0.80
10	Gemini 2.5 Pro	73.8%	0.93	0.98	0.83	0.97	0.79
11	GPT-4 Turbo	50.0%	0.92	0.95	0.85	0.95	0.71
12	GPT-4o Mini	32.1%	0.91	1.00	0.72	1.00	0.69
13	Claude 3.5 Haiku	42.3%	0.85	0.81	1.00	0.74	0.68