Consistency on τ-bench (airline, clean)

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Sub-metric Comparison
Resource Consistency Breakdown
Per-Agent Task Outcome Consistency

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

Agent Leaderboard — Consistency
# Agent Acc Reliability Consistency Agg Outc Traj-D Traj-S Res
1 Claude Opus 4.7 84.6% 0.89 0.88 0.90 0.94 0.82 0.86
2 GPT-5.5 79.5% 0.89 0.84 0.83 0.88 0.82 0.84
3 Claude Opus 4.5 80.8% 0.88 0.84 0.83 0.86 0.81 0.84
4 Gemini 3.1 Pro 82.1% 0.86 0.81 0.79 0.82 0.74 0.85
5 GPT-5.2 60.3% 0.81 0.80 0.79 0.86 0.75 0.81
6 Gemini 3.5 Flash 80.8% 0.86 0.78 0.86 0.75 0.65 0.78
7 GPT-5.2 (medium) 67.9% 0.82 0.76 0.69 0.84 0.77 0.79
8 Claude Sonnet 4 78.2% 0.86 0.76 0.62 0.86 0.78 0.84
9 GPT-4 Turbo 57.7% 0.72 0.76 0.62 0.87 0.74 0.86
10 Gemini 2.5 Flash 59.0% 0.72 0.75 0.73 0.83 0.70 0.76
11 Claude 3 Haiku 20.5% 0.70 0.74 0.76 0.76 0.61 0.78
12 GPT-4o Mini 29.5% 0.67 0.74 0.66 0.82 0.70 0.80
13 Gemini 2.5 Pro 71.8% 0.81 0.74 0.66 0.85 0.72 0.77
14 Claude 3.5 Haiku 29.5% 0.71 0.73 0.56 0.86 0.75 0.82
15 O1 66.2% 0.81 0.71 0.52 0.87 0.72 0.82