Consistency on τ-bench (airline, original)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Consistency Agg Outc Traj-D Traj-S Res Overall
1 Claude Opus 4.5 58.4% 0.79 0.68 0.88 0.78 0.85 0.80
2 GPT-4o Mini 21.3% 0.76 0.72 0.83 0.72 0.80 0.67
3 Gemini 3.0 Pro 58.8% 0.76 0.62 0.86 0.77 0.84 0.77
4 Claude Sonnet 4.5 54.4% 0.73 0.54 0.86 0.77 0.85 0.79
5 GPT-5.2 42.0% 0.72 0.52 0.86 0.76 0.82 0.75
6 GPT-4 Turbo 35.6% 0.72 0.52 0.85 0.73 0.84 0.69
7 Claude 3.5 Haiku 29.6% 0.70 0.54 0.83 0.71 0.80 0.68
8 O1 49.6% 0.69 0.48 0.86 0.75 0.80 0.76
9 Gemini 2.5 Pro 52.8% 0.69 0.48 0.84 0.72 0.80 0.74
10 Gemini 2.0 Flash 32.0% 0.68 0.44 0.87 0.73 0.81 0.67
11 GPT-5.2 (xhigh) 51.6% 0.67 0.44 0.85 0.76 0.76 0.76
12 Claude 3.7 Sonnet 43.6% 0.67 0.40 0.82 0.72 0.82 0.72
13 Gemini 2.5 Flash 47.2% 0.64 0.38 0.85 0.70 0.76 0.70