Consistency on τ-bench (airline, clean)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Reliability Consistency Agg Outc Traj-D Traj-S Res
1 Claude Opus 4.5 83.1% 0.88 0.82 0.77 0.88 0.79 0.85
2 GPT-5.2 59.2% 0.80 0.76 0.65 0.86 0.76 0.83
3 Gemini 3.0 Pro 80.8% 0.85 0.76 0.65 0.85 0.76 0.82
4 GPT-4o Mini 32.1% 0.69 0.76 0.69 0.84 0.73 0.80
5 O1 72.3% 0.80 0.73 0.58 0.87 0.75 0.80
6 Claude Sonnet 4.5 78.5% 0.85 0.72 0.50 0.85 0.77 0.85
7 GPT-4 Turbo 50.0% 0.71 0.72 0.54 0.85 0.73 0.83
8 GPT-5.2 (xhigh) 67.7% 0.81 0.70 0.54 0.85 0.73 0.77
9 Gemini 2.5 Pro 73.8% 0.79 0.68 0.46 0.84 0.73 0.81
10 Claude 3.5 Haiku 42.3% 0.68 0.66 0.42 0.82 0.70 0.80
11 Gemini 2.0 Flash 44.6% 0.70 0.65 0.35 0.89 0.73 0.80
12 Claude 3.7 Sonnet 56.2% 0.75 0.65 0.35 0.84 0.74 0.81
13 Gemini 2.5 Flash 65.4% 0.73 0.62 0.31 0.85 0.72 0.77