HAL Reliability Evaluation

Robustness on τ-bench (airline, original)

How well does the agent maintain accuracy when inputs are perturbed (faults, structural changes, prompt rewording)? See full definition →

#	Agent	Acc	Reliability	Robustness Agg	Fault	Struct	Prompt
1	Claude Sonnet 4.5	54.4%	0.79	0.98	0.97	1.00	0.97
2	GPT-4 Turbo	35.6%	0.69	0.98	0.98	0.96	1.00
3	GPT-5.2	42.0%	0.75	0.97	1.00	1.00	0.92
4	Claude 3.7 Sonnet	43.6%	0.72	0.96	1.00	1.00	0.89
5	Gemini 2.5 Pro	52.8%	0.74	0.96	0.97	0.98	0.92
6	Gemini 2.5 Flash	47.2%	0.70	0.95	1.00	0.97	0.88
7	GPT-5.2 (xhigh)	51.6%	0.76	0.95	1.00	1.00	0.84
8	Gemini 2.0 Flash	32.0%	0.67	0.94	0.98	1.00	0.85
9	Gemini 3.0 Pro	58.8%	0.77	0.94	1.00	0.92	0.91
10	O1	49.6%	0.76	0.94	0.95	1.00	0.86
11	Claude Opus 4.5	58.4%	0.80	0.93	0.95	0.96	0.89
12	GPT-4o Mini	21.3%	0.67	0.92	1.00	0.84	0.91
13	Claude 3.5 Haiku	29.6%	0.68	0.88	0.86	1.00	0.77