HAL Reliability Evaluation

Consistency on τ-bench (airline, original)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

#	Agent	Acc	Consistency Agg	Outc	Traj-D	Traj-S	Res	Overall
1	Claude Opus 4.5	58.4%	0.79	0.68	0.88	0.78	0.85	0.80
2	GPT-4o Mini	21.3%	0.76	0.72	0.83	0.72	0.80	0.67
3	Gemini 3.0 Pro	58.8%	0.76	0.62	0.86	0.77	0.84	0.77
4	Claude Sonnet 4.5	54.4%	0.73	0.54	0.86	0.77	0.85	0.79
5	GPT-5.2	42.0%	0.72	0.52	0.86	0.76	0.82	0.75
6	GPT-4 Turbo	35.6%	0.72	0.52	0.85	0.73	0.84	0.69
7	Claude 3.5 Haiku	29.6%	0.70	0.54	0.83	0.71	0.80	0.68
8	O1	49.6%	0.69	0.48	0.86	0.75	0.80	0.76
9	Gemini 2.5 Pro	52.8%	0.69	0.48	0.84	0.72	0.80	0.74
10	Gemini 2.0 Flash	32.0%	0.68	0.44	0.87	0.73	0.81	0.67
11	GPT-5.2 (xhigh)	51.6%	0.67	0.44	0.85	0.76	0.76	0.76
12	Claude 3.7 Sonnet	43.6%	0.67	0.40	0.82	0.72	0.82	0.72
13	Gemini 2.5 Flash	47.2%	0.64	0.38	0.85	0.70	0.76	0.70