Accuracy is not enough.

Rising accuracy scores suggest rapid progress, but agents still fail unpredictably in practice. A single success metric obscures whether agents behave consistently across runs, withstand perturbations, fail predictably, or respect safety constraints. We evaluate 14 agents across 2 benchmarks on twelve metrics spanning four reliability dimensions — and find that recent capability gains have yielded only small improvements in reliability.

Agents Evaluated

Benchmarks

Reliability Dimensions

Paper Key Findings Methodology Cite This Work

Reliability Trends

Agent Leaderboard

#	Agent	Acc	Consistency					Predictability				Robustness				Safety			Overall
#	Agent	Acc	Agg	Outc	Traj-D	Traj-S	Res	Agg	Cal	AUROC	Brier	Agg	Fault	Struct	Prompt	Agg	Harm	Comp	Overall
1	Gemini 3.0 Pro	80.8%	0.76	0.65	0.85	0.76	0.82	0.81	0.82	0.52	0.81	0.98	1.00	1.00	0.95	0.98	0.25	0.97	0.85
2	Claude Opus 4.5	77.3%	0.74	0.73	0.80	0.67	0.76	0.84	0.93	0.70	0.84	0.95	0.98	0.96	0.91	0.99	0.75	0.99	0.85
3	Claude Sonnet 4.5	76.6%	0.68	0.57	0.77	0.63	0.76	0.83	0.90	0.67	0.83	0.97	1.00	0.97	0.96	1.00	0.75	1.00	0.83
4	GPT-5.2 (xhigh)	67.7%	0.70	0.54	0.85	0.73	0.77	0.78	0.81	0.75	0.78	0.96	1.00	1.00	0.89	0.95	0.40	0.92	0.81
5	Gemini 2.5 Pro	62.0%	0.65	0.53	0.79	0.65	0.71	0.76	0.77	0.73	0.76	0.95	0.99	0.92	0.93	0.96	0.37	0.94	0.79
6	GPT-5.2	44.6%	0.69	0.62	0.81	0.66	0.72	0.70	0.72	0.68	0.70	0.91	1.00	0.97	0.77	0.97	0.41	0.95	0.77
7	O1	53.5%	0.72	0.64	0.83	0.73	0.75	0.74	0.73	0.64	0.74	0.82	0.86	0.89	0.72	0.97	0.62	0.93	0.76
8	Claude 3.7 Sonnet	59.3%	0.64	0.49	0.78	0.64	0.71	0.70	0.76	0.58	0.70	0.94	0.96	0.98	0.89	0.95	0.48	0.91	0.76
9	Gemini 2.5 Flash	51.6%	0.60	0.41	0.77	0.64	0.69	0.67	0.67	0.68	0.67	0.97	1.00	1.00	0.90	0.96	0.43	0.94	0.74
10	GPT-5.2 (medium)	42.6%	0.58	0.55	0.70	0.50	0.59	0.70	0.74	0.65	0.70	0.95	0.97	1.00	0.88	0.98	0.50	0.95	0.74
11	GPT-4 Turbo	35.0%	0.71	0.63	0.82	0.69	0.75	0.63	0.60	0.64	0.63	0.87	0.91	0.80	0.88	0.94	0.47	0.89	0.74
12	GPT-4o Mini	27.0%	0.70	0.66	0.75	0.64	0.73	0.55	0.49	0.64	0.55	0.89	0.90	0.85	0.92	0.91	0.45	0.85	0.71
13	Claude 3.5 Haiku	35.2%	0.68	0.53	0.83	0.72	0.74	0.63	0.62	0.57	0.63	0.82	0.88	0.89	0.68	0.90	0.71	0.83	0.71
14	Gemini 2.0 Flash	36.2%	0.63	0.47	0.82	0.66	0.67	0.60	0.56	0.70	0.60	0.87	0.93	0.84	0.85	0.93	0.45	0.88	0.70

Benchmarks

GAIA

Real-world question-answering tasks requiring multi-step reasoning, tool use, and web browsing.

View detailed reliability analysis →

τ-bench (airline, clean)

Curated 26-task subset of the airline domain with grading and specification issues removed. Used in main results.

View detailed reliability analysis →

Key Findings

Reliability Lags Behind Accuracy Improvements

Despite 18 months of model development, overall reliability shows only small improvements over time while accuracy steadily climbs. Improving raw task performance is insufficient for building dependable AI agents — reliability requires targeted attention beyond capability scaling alone.

Reliability improvements are also disproportionate across evaluation scenarios: highly structured environments show moderate gains, while open-ended tasks show barely any improvement, even among the latest models.

Outcome and Resource Consistency Remain Low

Agents that can solve a task often fail to do so consistently. The gap between capability (pass@k) and reliability (pass^k) is substantial across all models. Resource consistency is similarly low, with high variance in token and compute usage across runs — agents allocate effort unpredictably.

A 'what but not when' pattern emerges: agents achieve substantially higher distribution consistency than sequence consistency, indicating they reliably select similar action types across runs but vary in execution order. Improving reliability requires not just better action selection but more stable planning and execution.

Calibration Improves, but Discrimination Stagnates

Calibration — the alignment between predicted confidence and actual accuracy — has improved noticeably in recent frontier models. However, discrimination — the ability to distinguish tasks the agent will solve from those it won't — shows divergent trends across benchmarks and has in some cases worsened.

Improvements in calibration alone do not guarantee reliable failure identification. An agent may express well-calibrated confidence yet still fail to distinguish correct from incorrect predictions. Both sub-metrics must be measured independently.

Robustness Saturates, but Prompt Sensitivity Distinguishes Models

Fault robustness and structural robustness show ceiling effects across most models — agents handle genuine technical failures gracefully. In contrast, prompt robustness remains a key differentiator: sensitivity to superficial instruction paraphrasing varies substantially across models.

This pattern is counterintuitive: models tolerate real infrastructure faults but remain vulnerable to surface-level variations in how tasks are specified — a critical concern for real-world deployment where user instructions naturally vary.

Reliability Does Not Scale Uniformly with Capability

While calibration, robustness, and safety generally improve with model size, consistency often exhibits an inverse pattern: smaller models frequently achieve equal or higher consistency than their larger counterparts. Reasoning models are generally more reliable, but their reliability does not improve as quickly as their accuracy.

Larger models have more solution paths available, which increases run-to-run variability. This suggests that scaling alone will not solve the reliability problem — targeted architectural and training interventions are needed.

Safety Improves, but High-Severity Violations Persist

The most recent frontier models exhibit significantly lower overall violation rates. However, financial accuracy violations — incorrect charges and refunds — remain the most prevalent failure mode. Even infrequent high-severity failures can carry significant costs and represent critical blockers for deployment.

Benchmark quality also matters: safety and predictability improve almost universally when evaluated on a verified task subset with grading errors removed, underscoring the importance of clean evaluation data.

Reliability Gains Are Disproportionate Across Benchmarks

Reliability profiles are highly task-type dependent. An agent that is reliable on open-ended multi-step reasoning may struggle on structured customer-service tasks, and vice versa. Dimension-level scores vary substantially across benchmarks for the same agent.

This highlights the need for multi-benchmark evaluation. Single-benchmark reliability scores can be misleading — agents must be tested across diverse task structures to build a complete picture of their reliability.

Recommendations

Evaluate with Dynamic, Multi-Run Protocols

Single-run accuracy on fixed benchmarks provides a misleadingly narrow view of capability. Use multi-run protocols to assess variance across identical tasks, multi-condition protocols to systematically perturb user inputs, and temporal re-evaluation at regular intervals to detect silent degradation.

Current benchmarks are too static. Generative benchmarks with parameterized test sets (renaming fields, reordering responses, injecting faults) would provide more realistic and robust evaluations.

Design Agents Explicitly for Reliability

Calibration and safety have improved noticeably — evidence that intentional optimization works. In contrast, consistency and discrimination show little progress, suggesting they are not yet explicit optimization targets. Make reliability dimensions measurable and actionable in agent development.

Capability-oriented evaluation alone misses actionable optimization targets. Use reliability metrics to identify which dimensions lack progress and need targeted attention.

Use Reliability Metrics for Deployment Governance

Treat reliability as a deployment prerequisite, similar to aviation safety standards. Set minimum thresholds for consistency and safety before production deployment, implement incident reporting, and use multi-dimensional reliability metrics to guide change management decisions.

Organizations should require reliability certification before deployment, not just capability assessment. Diverse contributions through dimension-specific optimization become possible with clear measurement.

Distinguish Automation vs. Augmentation Use Cases

Reliability requirements differ fundamentally by use case. For augmentation (coding assistants, copilots), moderate reliability may suffice since humans review output. For automation (customer service, database management), reliability is a hard prerequisite — 90% success with unpredictable 10% failures is unacceptable.

As the field pushes toward greater agent autonomy, the reliability bar rises significantly. Deployment standards should be context-aware and scale with the level of autonomous action.