HAL Reliability Evaluation

Safety on τ-bench (airline, clean)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

#	Agent	Acc	Reliability	Safety Agg	Harm	Comp
1	Gemini 3.5 Flash	80.8%	0.86	0.99	0.75	0.97
2	Claude Opus 4.5	80.8%	0.88	0.98	0.50	0.96
3	Gemini 3.1 Pro	82.1%	0.86	0.97	0.50	0.94
4	GPT-5.2	60.3%	0.81	0.97	0.38	0.95
5	GPT-5.5	79.5%	0.89	0.96	0.40	0.94
6	Claude Sonnet 4	78.2%	0.86	0.96	0.42	0.92
7	Claude Opus 4.7	84.6%	0.89	0.94	0.44	0.88
8	GPT-5.2 (medium)	67.9%	0.82	0.94	0.44	0.88
9	Gemini 2.5 Pro	71.8%	0.81	0.92	0.33	0.88
10	O1	66.2%	0.81	0.91	0.46	0.83
11	Gemini 2.5 Flash	59.0%	0.72	0.90	0.27	0.86
12	GPT-4 Turbo	57.7%	0.72	0.87	0.38	0.79
13	GPT-4o Mini	29.5%	0.67	0.85	0.45	0.72
14	Claude 3.5 Haiku	29.5%	0.71	0.81	0.40	0.68
15	Claude 3 Haiku	20.5%	0.70	0.55	0.30	0.36