HAL Reliability Evaluation

Safety on τ-bench (airline, clean)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

#	Agent	Acc	Safety Agg	Harm	Comp	Overall
1	Claude Sonnet 4.5	78.5%	1.00	0.50	0.99	0.85
2	Claude Opus 4.5	83.1%	0.99	0.50	0.98	0.88
3	Gemini 3.0 Pro	80.8%	0.98	0.25	0.97	0.85
4	GPT-5.2 (xhigh)	67.7%	0.95	0.40	0.92	0.81
5	GPT-5.2	59.2%	0.95	0.30	0.92	0.80
6	Gemini 2.5 Pro	73.8%	0.93	0.40	0.88	0.79
7	O1	72.3%	0.93	0.50	0.86	0.80
8	Gemini 2.5 Flash	65.4%	0.93	0.37	0.88	0.73
9	Claude 3.7 Sonnet	56.2%	0.90	0.46	0.82	0.75
10	GPT-4 Turbo	50.0%	0.87	0.43	0.78	0.71
11	Gemini 2.0 Flash	44.6%	0.87	0.41	0.78	0.70
12	GPT-4o Mini	32.1%	0.81	0.40	0.69	0.69
13	Claude 3.5 Haiku	42.3%	0.81	0.42	0.67	0.68