HAL Reliability Evaluation

Consistency on τ-bench (airline, clean)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

#	Agent	Acc	Reliability	Consistency Agg	Outc	Traj-D	Traj-S	Res
1	Claude Opus 4.7	84.6%	0.89	0.88	0.90	0.94	0.82	0.86
2	GPT-5.5	79.5%	0.89	0.84	0.83	0.88	0.82	0.84
3	Claude Opus 4.5	80.8%	0.88	0.84	0.83	0.86	0.81	0.84
4	Gemini 3.1 Pro	82.1%	0.86	0.81	0.79	0.82	0.74	0.85
5	GPT-5.2	60.3%	0.81	0.80	0.79	0.86	0.75	0.81
6	Gemini 3.5 Flash	80.8%	0.86	0.78	0.86	0.75	0.65	0.78
7	GPT-5.2 (medium)	67.9%	0.82	0.76	0.69	0.84	0.77	0.79
8	Claude Sonnet 4	78.2%	0.86	0.76	0.62	0.86	0.78	0.84
9	GPT-4 Turbo	57.7%	0.72	0.76	0.62	0.87	0.74	0.86
10	Gemini 2.5 Flash	59.0%	0.72	0.75	0.73	0.83	0.70	0.76
11	Claude 3 Haiku	20.5%	0.70	0.74	0.76	0.76	0.61	0.78
12	GPT-4o Mini	29.5%	0.67	0.74	0.66	0.82	0.70	0.80
13	Gemini 2.5 Pro	71.8%	0.81	0.74	0.66	0.85	0.72	0.77
14	Claude 3.5 Haiku	29.5%	0.71	0.73	0.56	0.86	0.75	0.82
15	O1	66.2%	0.81	0.71	0.52	0.87	0.72	0.82