HAL Reliability Evaluation

Benchmark: τ-bench (airline, original)

τ-bench (airline, original) is the full 50-task version of the τ-bench airline benchmark. It evaluates tool-augmented conversational agents on realistic customer-service scenarios in an airline domain. Each task presents a simulated user with a specific request — such as rebooking a flight, changing a seat, or processing a refund — and the agent must converse with the user while calling backend API tools (e.g., flight search, booking modification) to resolve the request. Note that 24 of the 50 tasks have known issues in grading or task specification (e.g., incorrect answer keys, policy contradictions, ambiguous descriptions). For a cleaner evaluation, see τ-bench (airline) which excludes these problematic tasks.

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains SABER: Small Actions, Big Errors -- Safeguarding Mutating Steps in LLM Agents GitHub Repository

Compare with curated subset (26 tasks with grading/specification issues removed) →

Reliability Trends

Agent Leaderboard

#	Agent	Acc	Consistency					Predictability				Robustness				Safety			Overall
#	Agent	Acc	Agg	Outc	Traj-D	Traj-S	Res	Agg	Cal	AUROC	Brier	Agg	Fault	Struct	Prompt	Agg	Harm	Comp	Overall
1	Claude Opus 4.5	58.4%	0.79	0.68	0.88	0.78	0.85	0.69	0.71	0.67	0.69	0.93	0.95	0.96	0.89	0.97	0.46	0.94	0.80
2	Claude Sonnet 4.5	54.4%	0.73	0.54	0.86	0.77	0.85	0.66	0.68	0.61	0.66	0.98	0.97	1.00	0.97	0.97	0.50	0.94	0.79
3	Gemini 3.0 Pro	58.8%	0.76	0.62	0.86	0.77	0.84	0.60	0.60	0.52	0.60	0.94	1.00	0.92	0.91	0.96	0.21	0.95	0.77
4	O1	49.6%	0.69	0.48	0.86	0.75	0.80	0.64	0.74	0.50	0.64	0.94	0.95	1.00	0.86	0.89	0.40	0.81	0.76
5	GPT-5.2 (xhigh)	51.6%	0.67	0.44	0.85	0.76	0.76	0.65	0.68	0.62	0.65	0.95	1.00	1.00	0.84	0.94	0.43	0.89	0.76
6	GPT-5.2	42.0%	0.72	0.52	0.86	0.76	0.82	0.55	0.55	0.56	0.55	0.97	1.00	1.00	0.92	0.93	0.36	0.89	0.75
7	Gemini 2.5 Pro	52.8%	0.69	0.48	0.84	0.72	0.80	0.56	0.56	0.58	0.56	0.96	0.97	0.98	0.92	0.89	0.40	0.81	0.74
8	Claude 3.7 Sonnet	43.6%	0.67	0.40	0.82	0.72	0.82	0.54	0.54	0.53	0.54	0.96	1.00	1.00	0.89	0.88	0.45	0.79	0.72
9	Gemini 2.5 Flash	47.2%	0.64	0.38	0.85	0.70	0.76	0.52	0.52	0.54	0.52	0.95	1.00	0.97	0.88	0.89	0.39	0.82	0.70
10	GPT-4 Turbo	35.6%	0.72	0.52	0.85	0.73	0.84	0.38	0.38	0.47	0.38	0.98	0.98	0.96	1.00	0.85	0.44	0.72	0.69
11	Claude 3.5 Haiku	29.6%	0.70	0.54	0.83	0.71	0.80	0.46	0.45	0.44	0.46	0.88	0.86	1.00	0.77	0.77	0.41	0.61	0.68
12	Gemini 2.0 Flash	32.0%	0.68	0.44	0.87	0.73	0.81	0.38	0.36	0.61	0.38	0.94	0.98	1.00	0.85	0.82	0.37	0.72	0.67
13	GPT-4o Mini	21.3%	0.76	0.72	0.83	0.72	0.80	0.32	0.29	0.48	0.32	0.92	1.00	0.84	0.91	0.76	0.41	0.59	0.67

Benchmark: τ-bench (airline, original)

Consistency →

Predictability →

Robustness →

Safety →

Reliability Trends

Agent Leaderboard