Consistency on τ-bench (airline, original)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Reliability Consistency Agg Outc Traj-D Traj-S Res
1 Claude Opus 4.5 58.4% 0.80 0.79 0.68 0.88 0.78 0.85
2 GPT-4o Mini 21.3% 0.67 0.76 0.72 0.83 0.72 0.80
3 Gemini 3.0 Pro 58.8% 0.77 0.76 0.62 0.86 0.77 0.84
4 Claude Sonnet 4.5 54.4% 0.79 0.73 0.54 0.86 0.77 0.85
5 GPT-5.2 42.0% 0.75 0.72 0.52 0.86 0.76 0.82
6 GPT-4 Turbo 35.6% 0.69 0.72 0.52 0.85 0.73 0.84
7 Claude 3.5 Haiku 29.6% 0.68 0.70 0.54 0.83 0.71 0.80
8 O1 49.6% 0.76 0.69 0.48 0.86 0.75 0.80
9 Gemini 2.5 Pro 52.8% 0.74 0.69 0.48 0.84 0.72 0.80
10 Gemini 2.0 Flash 32.0% 0.67 0.68 0.44 0.87 0.73 0.81
11 GPT-5.2 (xhigh) 51.6% 0.76 0.67 0.44 0.85 0.76 0.76
12 Claude 3.7 Sonnet 43.6% 0.72 0.67 0.40 0.82 0.72 0.82
13 Gemini 2.5 Flash 47.2% 0.70 0.64 0.38 0.85 0.70 0.76