HAL Reliability Evaluation

Robustness on τ-bench (airline, clean)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

#	Agent	Acc	Reliability	Robustness Agg	Fault	Struct	Prompt
1	Claude 3 Haiku	20.5%	0.70	0.98	1.00	0.94	1.00
2	O1	66.2%	0.81	0.98	1.00	1.00	0.93
3	Gemini 3.5 Flash	80.8%	0.86	0.97	1.00	1.00	0.92
4	GPT-5.5	79.5%	0.89	0.97	1.00	0.92	1.00
5	Claude Opus 4.5	80.8%	0.88	0.97	1.00	0.98	0.92
6	Claude Sonnet 4	78.2%	0.86	0.96	0.99	0.98	0.92
7	GPT-5.2 (medium)	67.9%	0.82	0.95	1.00	0.85	1.00
8	Gemini 3.1 Pro	82.1%	0.86	0.94	0.98	1.00	0.84
9	Gemini 2.5 Pro	71.8%	0.81	0.93	1.00	1.00	0.79
10	Claude 3.5 Haiku	29.5%	0.71	0.93	1.00	0.78	1.00
11	GPT-4o Mini	29.5%	0.67	0.93	1.00	0.91	0.87
12	GPT-5.2	60.3%	0.81	0.92	1.00	0.89	0.87
13	Claude Opus 4.7	84.6%	0.89	0.91	0.91	1.00	0.83
14	Gemini 2.5 Flash	59.0%	0.72	0.82	0.78	0.85	0.83
15	GPT-4 Turbo	57.7%	0.72	0.81	0.83	0.73	0.87