Compare: τ-bench (airline, original) vs τ-bench (airline, clean)

Side-by-side comparison of all reliability metrics on the original 50-task dataset versus the curated 26-task clean subset. The clean version excludes tasks with identified grading and specification issues and is used in the main results.

Accuracy
Overall Reliability
Consistency
Predictability
Robustness
Safety
All Metrics (Original / Clean)
Agent Accuracy Consistency Predictability Robustness Safety Overall
Outc Traj-D Traj-S Res Cal AUROC Brier Fault Struct Prompt Harm Comp Safety
Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean Orig Clean
Claude Opus 4.5 0.58 0.83 0.68 0.77 0.88 0.88 0.78 0.79 0.85 0.85 0.71 0.93 0.67 0.68 0.69 0.87 0.95 0.97 0.96 0.93 0.89 0.93 0.46 0.50 0.94 0.98 0.97 0.99 0.80 0.88
Claude Sonnet 4.5 0.54 0.78 0.54 0.50 0.86 0.85 0.77 0.77 0.85 0.85 0.68 0.90 0.61 0.68 0.66 0.84 0.97 1.00 1.00 1.00 0.97 0.98 0.50 0.50 0.94 0.99 0.97 1.00 0.79 0.85
Gemini 3.0 Pro 0.59 0.81 0.62 0.65 0.86 0.85 0.77 0.76 0.84 0.82 0.60 0.82 0.52 0.52 0.60 0.81 1.00 1.00 0.92 1.00 0.91 0.95 0.21 0.25 0.95 0.97 0.96 0.98 0.77 0.85
O1 0.50 0.72 0.48 0.58 0.86 0.87 0.75 0.75 0.80 0.80 0.74 0.77 0.50 0.45 0.64 0.75 0.95 0.95 1.00 1.00 0.86 0.85 0.40 0.50 0.81 0.86 0.89 0.93 0.76 0.80
GPT-5.2 (xhigh) 0.52 0.68 0.44 0.54 0.85 0.85 0.76 0.73 0.76 0.77 0.68 0.81 0.62 0.75 0.65 0.78 1.00 1.00 1.00 1.00 0.84 0.89 0.43 0.40 0.89 0.92 0.94 0.95 0.76 0.81
GPT-5.2 0.42 0.59 0.52 0.65 0.86 0.86 0.76 0.76 0.82 0.83 0.55 0.71 0.56 0.62 0.55 0.68 1.00 1.00 1.00 1.00 0.92 0.84 0.36 0.30 0.89 0.92 0.93 0.95 0.75 0.80
Gemini 2.5 Pro 0.53 0.74 0.48 0.46 0.84 0.84 0.72 0.73 0.80 0.81 0.56 0.77 0.58 0.70 0.56 0.77 0.97 0.98 0.98 0.83 0.92 0.97 0.40 0.40 0.81 0.88 0.89 0.93 0.74 0.79
Claude 3.7 Sonnet 0.44 0.56 0.40 0.35 0.82 0.84 0.72 0.74 0.82 0.81 0.54 0.65 0.53 0.48 0.54 0.63 1.00 1.00 1.00 1.00 0.89 0.91 0.45 0.46 0.79 0.82 0.88 0.90 0.72 0.75
Gemini 2.5 Flash 0.47 0.65 0.38 0.31 0.85 0.85 0.70 0.72 0.76 0.77 0.52 0.62 0.54 0.53 0.52 0.61 1.00 1.00 0.97 1.00 0.88 0.86 0.39 0.37 0.82 0.88 0.89 0.93 0.70 0.73
GPT-4 Turbo 0.36 0.50 0.52 0.54 0.85 0.85 0.73 0.73 0.84 0.83 0.38 0.52 0.47 0.45 0.38 0.51 0.98 0.95 0.96 0.85 1.00 0.95 0.44 0.43 0.72 0.78 0.85 0.87 0.69 0.71
Claude 3.5 Haiku 0.30 0.42 0.54 0.42 0.83 0.82 0.71 0.70 0.80 0.80 0.45 0.53 0.44 0.42 0.46 0.53 0.86 0.81 1.00 1.00 0.77 0.74 0.41 0.42 0.61 0.67 0.77 0.81 0.68 0.68
Gemini 2.0 Flash 0.32 0.45 0.44 0.35 0.87 0.89 0.73 0.73 0.81 0.80 0.36 0.46 0.61 0.56 0.38 0.48 0.98 0.98 1.00 1.00 0.85 0.98 0.37 0.41 0.72 0.78 0.82 0.87 0.67 0.70
GPT-4o Mini 0.21 0.32 0.72 0.69 0.83 0.84 0.72 0.73 0.80 0.80 0.29 0.39 0.48 0.48 0.32 0.41 1.00 1.00 0.84 0.72 0.91 1.00 0.41 0.40 0.59 0.69 0.76 0.81 0.67 0.69