Compare: τ-bench (airline, original) vs τ-bench (airline, clean)
Side-by-side comparison of all reliability metrics on the original 50-task dataset versus the curated 26-task clean subset. The clean version excludes tasks with identified grading and specification issues and is used in the main results.
Accuracy
Overall Reliability
Consistency
Predictability
Robustness
Safety
All Metrics (Original / Clean)
| Agent | Accuracy | Consistency | Predictability | Robustness | Safety | Overall | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Outc | Traj-D | Traj-S | Res | Cal | AUROC | Brier | Fault | Struct | Prompt | Harm | Comp | Safety | ||||||||||||||||||
| Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | |
| 0.58 | 0.83 | 0.68 | 0.77 | 0.88 | 0.88 | 0.78 | 0.79 | 0.85 | 0.85 | 0.71 | 0.93 | 0.67 | 0.68 | 0.69 | 0.87 | 0.95 | 0.97 | 0.96 | 0.93 | 0.89 | 0.93 | 0.46 | 0.50 | 0.94 | 0.98 | 0.97 | 0.99 | 0.80 | 0.88 | |
| 0.54 | 0.78 | 0.54 | 0.50 | 0.86 | 0.85 | 0.77 | 0.77 | 0.85 | 0.85 | 0.68 | 0.90 | 0.61 | 0.68 | 0.66 | 0.84 | 0.97 | 1.00 | 1.00 | 1.00 | 0.97 | 0.98 | 0.50 | 0.50 | 0.94 | 0.99 | 0.97 | 1.00 | 0.79 | 0.85 | |
| 0.59 | 0.81 | 0.62 | 0.65 | 0.86 | 0.85 | 0.77 | 0.76 | 0.84 | 0.82 | 0.60 | 0.82 | 0.52 | 0.52 | 0.60 | 0.81 | 1.00 | 1.00 | 0.92 | 1.00 | 0.91 | 0.95 | 0.21 | 0.25 | 0.95 | 0.97 | 0.96 | 0.98 | 0.77 | 0.85 | |
| 0.50 | 0.72 | 0.48 | 0.58 | 0.86 | 0.87 | 0.75 | 0.75 | 0.80 | 0.80 | 0.74 | 0.77 | 0.50 | 0.45 | 0.64 | 0.75 | 0.95 | 0.95 | 1.00 | 1.00 | 0.86 | 0.85 | 0.40 | 0.50 | 0.81 | 0.86 | 0.89 | 0.93 | 0.76 | 0.80 | |
| 0.52 | 0.68 | 0.44 | 0.54 | 0.85 | 0.85 | 0.76 | 0.73 | 0.76 | 0.77 | 0.68 | 0.81 | 0.62 | 0.75 | 0.65 | 0.78 | 1.00 | 1.00 | 1.00 | 1.00 | 0.84 | 0.89 | 0.43 | 0.40 | 0.89 | 0.92 | 0.94 | 0.95 | 0.76 | 0.81 | |
| 0.42 | 0.59 | 0.52 | 0.65 | 0.86 | 0.86 | 0.76 | 0.76 | 0.82 | 0.83 | 0.55 | 0.71 | 0.56 | 0.62 | 0.55 | 0.68 | 1.00 | 1.00 | 1.00 | 1.00 | 0.92 | 0.84 | 0.36 | 0.30 | 0.89 | 0.92 | 0.93 | 0.95 | 0.75 | 0.80 | |
| 0.53 | 0.74 | 0.48 | 0.46 | 0.84 | 0.84 | 0.72 | 0.73 | 0.80 | 0.81 | 0.56 | 0.77 | 0.58 | 0.70 | 0.56 | 0.77 | 0.97 | 0.98 | 0.98 | 0.83 | 0.92 | 0.97 | 0.40 | 0.40 | 0.81 | 0.88 | 0.89 | 0.93 | 0.74 | 0.79 | |
| 0.44 | 0.56 | 0.40 | 0.35 | 0.82 | 0.84 | 0.72 | 0.74 | 0.82 | 0.81 | 0.54 | 0.65 | 0.53 | 0.48 | 0.54 | 0.63 | 1.00 | 1.00 | 1.00 | 1.00 | 0.89 | 0.91 | 0.45 | 0.46 | 0.79 | 0.82 | 0.88 | 0.90 | 0.72 | 0.75 | |
| 0.47 | 0.65 | 0.38 | 0.31 | 0.85 | 0.85 | 0.70 | 0.72 | 0.76 | 0.77 | 0.52 | 0.62 | 0.54 | 0.53 | 0.52 | 0.61 | 1.00 | 1.00 | 0.97 | 1.00 | 0.88 | 0.86 | 0.39 | 0.37 | 0.82 | 0.88 | 0.89 | 0.93 | 0.70 | 0.73 | |
| 0.36 | 0.50 | 0.52 | 0.54 | 0.85 | 0.85 | 0.73 | 0.73 | 0.84 | 0.83 | 0.38 | 0.52 | 0.47 | 0.45 | 0.38 | 0.51 | 0.98 | 0.95 | 0.96 | 0.85 | 1.00 | 0.95 | 0.44 | 0.43 | 0.72 | 0.78 | 0.85 | 0.87 | 0.69 | 0.71 | |
| 0.30 | 0.42 | 0.54 | 0.42 | 0.83 | 0.82 | 0.71 | 0.70 | 0.80 | 0.80 | 0.45 | 0.53 | 0.44 | 0.42 | 0.46 | 0.53 | 0.86 | 0.81 | 1.00 | 1.00 | 0.77 | 0.74 | 0.41 | 0.42 | 0.61 | 0.67 | 0.77 | 0.81 | 0.68 | 0.68 | |
| 0.32 | 0.45 | 0.44 | 0.35 | 0.87 | 0.89 | 0.73 | 0.73 | 0.81 | 0.80 | 0.36 | 0.46 | 0.61 | 0.56 | 0.38 | 0.48 | 0.98 | 0.98 | 1.00 | 1.00 | 0.85 | 0.98 | 0.37 | 0.41 | 0.72 | 0.78 | 0.82 | 0.87 | 0.67 | 0.70 | |
| 0.21 | 0.32 | 0.72 | 0.69 | 0.83 | 0.84 | 0.72 | 0.73 | 0.80 | 0.80 | 0.29 | 0.39 | 0.48 | 0.48 | 0.32 | 0.41 | 1.00 | 1.00 | 0.84 | 0.72 | 0.91 | 1.00 | 0.41 | 0.40 | 0.59 | 0.69 | 0.76 | 0.81 | 0.67 | 0.69 | |