HAL Reliability Evaluation

Consistency on τ-bench (airline, clean)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

#	Agent	Acc	Consistency Agg	Outc	Traj-D	Traj-S	Res	Overall
1	Claude Opus 4.5	83.1%	0.82	0.77	0.88	0.79	0.85	0.88
2	GPT-5.2	59.2%	0.76	0.65	0.86	0.76	0.83	0.80
3	Gemini 3.0 Pro	80.8%	0.76	0.65	0.85	0.76	0.82	0.85
4	GPT-4o Mini	32.1%	0.76	0.69	0.84	0.73	0.80	0.69
5	O1	72.3%	0.73	0.58	0.87	0.75	0.80	0.80
6	Claude Sonnet 4.5	78.5%	0.72	0.50	0.85	0.77	0.85	0.85
7	GPT-4 Turbo	50.0%	0.72	0.54	0.85	0.73	0.83	0.71
8	GPT-5.2 (xhigh)	67.7%	0.70	0.54	0.85	0.73	0.77	0.81
9	Gemini 2.5 Pro	73.8%	0.68	0.46	0.84	0.73	0.81	0.79
10	Claude 3.5 Haiku	42.3%	0.66	0.42	0.82	0.70	0.80	0.68
11	Gemini 2.0 Flash	44.6%	0.65	0.35	0.89	0.73	0.80	0.70
12	Claude 3.7 Sonnet	56.2%	0.65	0.35	0.84	0.74	0.81	0.75
13	Gemini 2.5 Flash	65.4%	0.62	0.31	0.85	0.72	0.77	0.73