HAL Reliability Evaluation

Robustness on τ-bench (airline, clean)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

#	Agent	Acc	Reliability	Robustness Agg	Fault	Struct	Prompt
1	Claude Sonnet 4.5	78.5%	0.85	0.99	1.00	1.00	0.98
2	Gemini 2.0 Flash	44.6%	0.70	0.98	0.98	1.00	0.98
3	Gemini 3.0 Pro	80.8%	0.85	0.98	1.00	1.00	0.95
4	Claude 3.7 Sonnet	56.2%	0.75	0.97	1.00	1.00	0.91
5	GPT-5.2 (xhigh)	67.7%	0.81	0.96	1.00	1.00	0.89
6	Gemini 2.5 Flash	65.4%	0.73	0.95	1.00	1.00	0.86
7	GPT-5.2	59.2%	0.80	0.95	1.00	1.00	0.84
8	Claude Opus 4.5	83.1%	0.88	0.94	0.97	0.93	0.93
9	O1	72.3%	0.80	0.93	0.95	1.00	0.85
10	Gemini 2.5 Pro	73.8%	0.79	0.93	0.98	0.83	0.97
11	GPT-4 Turbo	50.0%	0.71	0.92	0.95	0.85	0.95
12	GPT-4o Mini	32.1%	0.69	0.91	1.00	0.72	1.00
13	Claude 3.5 Haiku	42.3%	0.68	0.85	0.81	1.00	0.74