HAL Reliability Evaluation

Benchmark: τ-bench (airline, clean)

τ-bench (airline, clean) evaluates tool-augmented conversational agents on realistic customer-service scenarios set in an airline domain. This is a curated 26-task subset of the original 50-task benchmark, excluding tasks with identified issues in grading or task specification — such as incorrect answer keys (e.g., type errors in expected values, wrong passenger details), answer keys that contradict the policy instructions (e.g., cancelling flights that policy forbids cancelling, issuing certificates without required preconditions), ambiguous or underspecified task descriptions, and tasks referencing past dates that make the requested actions impossible. Each task presents a simulated user with a specific request and the agent must converse while calling backend API tools to resolve it. Tasks vary in complexity from simple single-action lookups to multi-turn dialogues requiring policy adherence and disambiguation. All metrics are computed from scratch on this curated subset, providing a more reliable estimate of agent performance and reliability. This clean version is used in the main results and aggregate scores.

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents GitHub Repository

Compare with original dataset (all 50 tasks) →

Reliability Trends

Agent Leaderboard

#	Agent	Acc	Consistency					Predictability				Robustness				Safety			Overall
#	Agent	Acc	Agg	Outc	Traj-D	Traj-S	Res	Agg	Cal	AUROC	Brier	Agg	Fault	Struct	Prompt	Agg	Harm	Comp	Overall
1	Claude Opus 4.5	83.1%	0.82	0.77	0.88	0.79	0.85	0.87	0.93	0.68	0.87	0.94	0.97	0.93	0.93	0.99	0.50	0.98	0.88
2	Claude Sonnet 4.5	78.5%	0.72	0.50	0.85	0.77	0.85	0.84	0.90	0.68	0.84	0.99	1.00	1.00	0.98	1.00	0.50	0.99	0.85
3	Gemini 3.0 Pro	80.8%	0.76	0.65	0.85	0.76	0.82	0.81	0.82	0.52	0.81	0.98	1.00	1.00	0.95	0.98	0.25	0.97	0.85
4	GPT-5.2 (xhigh)	67.7%	0.70	0.54	0.85	0.73	0.77	0.78	0.81	0.75	0.78	0.96	1.00	1.00	0.89	0.95	0.40	0.92	0.81
5	O1	72.3%	0.73	0.58	0.87	0.75	0.80	0.75	0.77	0.45	0.75	0.93	0.95	1.00	0.85	0.93	0.50	0.86	0.80
6	GPT-5.2	59.2%	0.76	0.65	0.86	0.76	0.83	0.68	0.71	0.62	0.68	0.95	1.00	1.00	0.84	0.95	0.30	0.92	0.80
7	Gemini 2.5 Pro	73.8%	0.68	0.46	0.84	0.73	0.81	0.77	0.77	0.70	0.77	0.93	0.98	0.83	0.97	0.93	0.40	0.88	0.79
8	Claude 3.7 Sonnet	56.2%	0.65	0.35	0.84	0.74	0.81	0.63	0.65	0.48	0.63	0.97	1.00	1.00	0.91	0.90	0.46	0.82	0.75
9	Gemini 2.5 Flash	65.4%	0.62	0.31	0.85	0.72	0.77	0.61	0.62	0.53	0.61	0.95	1.00	1.00	0.86	0.93	0.37	0.88	0.73
10	GPT-4 Turbo	50.0%	0.72	0.54	0.85	0.73	0.83	0.51	0.52	0.45	0.51	0.92	0.95	0.85	0.95	0.87	0.43	0.78	0.71
11	Gemini 2.0 Flash	44.6%	0.65	0.35	0.89	0.73	0.80	0.48	0.46	0.56	0.48	0.98	0.98	1.00	0.98	0.87	0.41	0.78	0.70
12	GPT-4o Mini	32.1%	0.76	0.69	0.84	0.73	0.80	0.41	0.39	0.48	0.41	0.91	1.00	0.72	1.00	0.81	0.40	0.69	0.69
13	Claude 3.5 Haiku	42.3%	0.66	0.42	0.82	0.70	0.80	0.53	0.53	0.42	0.53	0.85	0.81	1.00	0.74	0.81	0.42	0.67	0.68

Benchmark: τ-bench (airline, clean)

Consistency →

Predictability →

Robustness →

Safety →

Reliability Trends

Agent Leaderboard