Compare: τ-bench (airline, original) vs τ-bench (airline, clean)
Side-by-side comparison of all reliability metrics on the original 50-task dataset versus the curated 26-task clean subset. The clean version excludes tasks with identified grading and specification issues and is used in the main results.
Accuracy
Overall Reliability
Consistency
Predictability
Robustness
Safety
All Metrics (Original / Clean)
| Agent | Accuracy | Consistency | Predictability | Robustness | Safety | Reliability | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Outc | Traj-D | Traj-S | Res | Cal | AUROC | Brier | Fault | Struct | Prompt | Harm | Comp | Safety | ||||||||||||||||||
| Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | Orig | Clean | |
| 0.58 | 0.81 | 0.68 | 0.83 | 0.88 | 0.86 | 0.78 | 0.81 | 0.85 | 0.84 | 0.71 | 0.90 | 0.67 | 0.70 | 0.69 | 0.85 | 0.95 | 1.00 | 0.96 | 0.98 | 0.89 | 0.92 | 0.46 | 0.50 | 0.94 | 0.96 | 0.97 | 0.98 | 0.80 | 0.88 | |
| 0.50 | 0.66 | 0.48 | 0.52 | 0.86 | 0.87 | 0.75 | 0.72 | 0.80 | 0.82 | 0.74 | 0.83 | 0.50 | 0.52 | 0.64 | 0.74 | 0.95 | 1.00 | 1.00 | 1.00 | 0.86 | 0.93 | 0.40 | 0.46 | 0.81 | 0.83 | 0.89 | 0.91 | 0.76 | 0.81 | |
| 0.42 | 0.60 | 0.52 | 0.79 | 0.86 | 0.86 | 0.76 | 0.75 | 0.82 | 0.81 | 0.55 | 0.73 | 0.56 | 0.57 | 0.55 | 0.69 | 1.00 | 1.00 | 1.00 | 0.89 | 0.92 | 0.87 | 0.36 | 0.38 | 0.89 | 0.95 | 0.93 | 0.97 | 0.75 | 0.81 | |
| 0.53 | 0.72 | 0.48 | 0.66 | 0.84 | 0.85 | 0.72 | 0.72 | 0.80 | 0.77 | 0.56 | 0.77 | 0.58 | 0.67 | 0.56 | 0.78 | 0.97 | 1.00 | 0.98 | 1.00 | 0.92 | 0.79 | 0.40 | 0.33 | 0.81 | 0.88 | 0.89 | 0.92 | 0.74 | 0.81 | |
| 0.47 | 0.59 | 0.38 | 0.73 | 0.85 | 0.83 | 0.70 | 0.70 | 0.76 | 0.76 | 0.52 | 0.60 | 0.54 | 0.53 | 0.52 | 0.60 | 1.00 | 0.78 | 0.97 | 0.85 | 0.88 | 0.83 | 0.39 | 0.27 | 0.82 | 0.86 | 0.89 | 0.90 | 0.70 | 0.72 | |
| 0.36 | 0.58 | 0.52 | 0.62 | 0.85 | 0.87 | 0.73 | 0.74 | 0.84 | 0.86 | 0.38 | 0.59 | 0.47 | 0.46 | 0.38 | 0.58 | 0.98 | 0.83 | 0.96 | 0.73 | 1.00 | 0.87 | 0.44 | 0.38 | 0.72 | 0.79 | 0.85 | 0.87 | 0.69 | 0.72 | |
| 0.30 | 0.29 | 0.54 | 0.56 | 0.83 | 0.86 | 0.71 | 0.75 | 0.80 | 0.82 | 0.45 | 0.46 | 0.44 | 0.42 | 0.46 | 0.47 | 0.86 | 1.00 | 1.00 | 0.78 | 0.77 | 1.00 | 0.41 | 0.40 | 0.61 | 0.68 | 0.77 | 0.81 | 0.68 | 0.71 | |
| 0.21 | 0.29 | 0.72 | 0.66 | 0.83 | 0.82 | 0.72 | 0.70 | 0.80 | 0.80 | 0.29 | 0.34 | 0.48 | 0.36 | 0.32 | 0.35 | 1.00 | 1.00 | 0.84 | 0.91 | 0.91 | 0.87 | 0.41 | 0.45 | 0.59 | 0.72 | 0.76 | 0.85 | 0.67 | 0.67 | |