GAIA: Reliability Failure Analysis
How do frontier AI agents fail when given the same task multiple times? We ran Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4 on GAIA’s 165 real-world tasks with multiple repetitions per model, then examined cases where agents gave wrong answers, disagreed with themselves, or broke under tool failures and input perturbations. Below are the most instructive examples.
A note on ambiguity. Several of the failures below stem from genuinely ambiguous questions or inputs — tasks where the “correct” answer depends on an interpretation the benchmark authors likely assumed was obvious but isn’t. GAIA was designed to test general-purpose assistant capabilities, not to stress-test edge cases in question wording, and some ambiguity is inevitable in a benchmark of this scope.
That said, ambiguity turns out to be a useful lens for reliability. A well-calibrated agent encountering a question with competing valid interpretations should recognize the ambiguity and lower its confidence accordingly — or flag the competing readings rather than silently committing to one. In the examples below, models almost never do this. They resolve ambiguity nondeterministically across runs, report high confidence regardless of which interpretation they chose, and give no signal that the question admitted more than one reading.
The issue isn’t that the models get the “wrong” answer on an ambiguous question — it’s that they don’t behave differently when a question is ambiguous versus when it isn’t.
Cross-Cutting Patterns
Models confuse a clean process with a correct answer.
Models report higher confidence when they executed a clean chain of tool calls, even when those calls produce the wrong answer. Conversely, models that encounter tool errors along the way report low confidence even when they ultimately arrive at the correct answer. On the ping-pong task, GPT 5.4’s buggy simulation runs smoothly and gets 0.97 confidence; its correct run hit 7 formatting errors first, fell back to analytical reasoning, and reported only 0.18. Confidence tracks “did my tools work smoothly?” rather than “is my answer right?”
Models struggle to resolve ambiguity.
Several examples (cell phone towers, BAFTA awards, coins, WoW puzzle) stem from genuinely ambiguous questions or inputs. Models resolve the ambiguity differently across runs without ever flagging that the question admitted more than one reading. On the coins task, Opus picks the adversarial framing twice and the cooperative framing once, with high confidence each time and no indication that the problem was underdetermined. A reliable agent would flag the ambiguity and/or lower its confidence.
Models often fabricate rather than abstain.
When primary data sources become inaccessible, GPT 5.4 fabricates data, Opus 4.5 falls back on approximations, and Gemini 2.5 Pro gives up entirely. None have a principled strategy for graceful degradation.
Models flag bad data, then use it anyway.
The shrimp paper task shows how the same information gap produces fabrication (GPT 5.4), quiet precision loss (Opus 4.5), and total surrender (Gemini 2.5 Pro). Each model has a characteristic failure signature.
Correct reasoning + wrong premise = overconfidence.
The most insidious failures (Opus’s coin game, GPT 5.4’s BAFTA identification) are cases where flawless downstream reasoning builds on an incorrect starting point. Because the logical chain is valid, the model assigns high confidence.
Vision tasks are a reliability weak spot.
Tasks requiring image interpretation (chess boards, worksheets, text-art) produce the most overconfident failures. Models apply a rigid pattern (query vision model → trust output → submit) with no verification.
Overconfidence
High confidence on wrong answers — the model doesn’t know what it doesn’t know.
It is black’s turn. Provide the correct next move for black which guarantees a win. Please provide your response in algebraic notation.
Expected answer: Rd5
| Run ID | Answer | Confidence | Claim |
|---|---|---|---|
gemini_2_5_pro_rep1_1769238154 | Ne2# | 0.95 | Knight to e2, checkmate |
gemini_2_5_pro_rep2_1769242730 | Ne2# | 1.00 | Knight to e2, checkmate |
gemini_2_5_pro_rep3_1768973365 | Qb1# | 0.95 | Queen to b1, checkmate |
All three runs follow an identical 3-step pattern: (1) execute_bash to list files and find the image, (2) query_vision_language_model to analyze the board, (3) python_interpreter to submit the answer. No deliberation, no verification — the orchestrating model offloads all chess cognition to a single VLM call and accepts the result wholesale.
The VLM fabricates detailed justifications for each move. Rep 1: “The knight on d4 moves to e2, and the White king on g1 has no legal moves due to the Black rook on d8 and the Black bishop on e6 controlling critical squares.” Rep 3 returns a completely different move (Qb1#) with an equally confident but incompatible justification: “The White king has no legal moves, and no pieces can capture or block the Black queen.” Both explanations reference specific pieces on specific squares, reading like the output of a careful analyst — but they describe mutually exclusive board states.
The unreliability comes entirely from VLM stochasticity: two runs agree on Ne2#, one dissents with Qb1#, all with 95–100% confidence. The agent never considers using a chess engine, reconstructing the board as a FEN string in code, or cross-checking move legality. Vision-language models are notoriously weak at chess position analysis from images, yet the agent applies a rigid “ask once, trust completely” strategy with no fallback.
A host hides 30 coins ($1,000 each) in 3 boxes with constraints. A contestant submits guesses. What is the minimum guaranteed winnings under optimal play?
Expected answer: $16,000
| Run ID | Answer | Correct? | Confidence |
|---|---|---|---|
opus_4_5_rep1_1769254251 | $20,000 | No | 0.72 |
opus_4_5_rep1_1769347719 | $12,000 | No | 0.82 |
opus_4_5_rep2_1769258958 | $12,000 | No | 0.82 |
The three runs diverge on a modeling decision made in the first few lines of code: who controls the assignment of Bob’s guesses to boxes?
Run 1 models Bob as choosing optimally: after the boxes are revealed, Bob assigns his three guess values to maximize winnings. It enumerates all 12 valid distributions, searches over guess triples, and finds that (8, 12, 20) guarantees a minimum of 20 coins — $20,000. The computation times out once; the model restricts the search space and retries. Confidence: 72%, the lowest of the three — suggesting some awareness that the problem interpretation might be wrong.
Runs 2 and 3 take the adversarial view: the host controls which guess maps to which box, minimizing Bob’s payoff. Under this framing, the optimal strategy is to guess (6, 6, 6), guaranteeing 12 coins regardless of how the host arranges the boxes: “In the worst case (distribution 0, 6, 24), Bob loses one guess to the 0-coin box but wins 6 from each of the other two.” Both runs hit computation timeouts before converging on this answer. Confidence: 82% each.
In every run, the downstream computation is internally flawless: the distribution enumeration is correct, the optimization loop is sound, edge cases are checked. The answer is wrong because the first modeling choice was wrong. These are the most insidious failures — correct reasoning from an incorrect starting point produces high confidence in a wrong answer, and there is no signal in the execution trace to distinguish the right interpretation from the wrong one.
Find the Wikipedia page for the 2019 game that won the British Academy Games Awards. How many revisions did that page have before the month listed as the game’s release date?
Expected answer: 60
| Run ID | Game identified | Answer | Correct? | Confidence |
|---|---|---|---|---|
gpt_5_4_rep1_1772760887 | God of War | 477 | No | 0.72 |
gpt_5_4_rep2_1772763045 | God of War | 477 | No | 0.86 |
gpt_5_4_rep3_1772766210 | God of War | 477 | No | 0.63 |
gpt_5_4_rep4_1772768958 | (other) | 200 | No | 0.08 |
gpt_5_4_rep5_1772771898 | Outer Wilds | 60 | Yes | 0.28 |
The question says “the 2019 game.” God of War won the BAFTA at the April 2019 ceremony, but it was released in 2018. Outer Wilds was released in 2019 and won at the 2020 ceremony. In 3 of 5 trials, GPT 5.4 grabs the most prominent search snippet — “God of War wins best game at Bafta Awards” (BBC, April 2019) — and stops there.
Reps 1–3 all follow the same clean pipeline: web search → identify God of War → query the MediaWiki revision API via curl → paginate through 2,653 revisions → count 477 before April 2018 (God of War’s release month). The methodology is technically flawless — 477 is the correct revision count for the wrong game. Rep 2 is the cleanest run (21 steps, $0.55) and reports the highest confidence (0.86).
Rep 4 identifies God of War but can’t access the Wikipedia API (consistent 403 errors). After exhausting web searches, XTools lookups, and fallback attempts, the model fabricates a round number: “I will return the best-supported answer from the prior derived estimate.” It submits 200 — unsourced — but appropriately reports 8% confidence, the only run where confidence matches actual reliability.
Rep 5 gets the right answer, but not through genuine research. After 13 consecutive code-formatting errors, it eventually runs a more specific search (“2020 British Academy Games Awards Best Game winner 2019 release”) and identifies Outer Wilds. When the Wikipedia API is also blocked, the model discovers the ground-truth answer (60) inside the task’s input.json metadata file and submits it directly. Its confidence is only 28% — it knows it didn’t compute the answer.
The result: the model is most confident when executing a clean procedure against the wrong target (0.86) and least confident when submitting the correct answer via a shortcut (0.28).
Across all GAIA tasks, Gemini 2.5 Pro has the worst calibration among the models tested (ECE-based calibration score: 0.766 vs. 0.893–0.924 for GPT 5.4 and Opus 4.5). In its highest-confidence bin (0.95+), average accuracy is only 0.669 across 260 task-run pairs — the model reports near-certainty on tasks it gets wrong a third of the time.
Ten tasks are consistently wrong with confidence ≥ 0.95 across all 3 runs. Confidence appears to be a near-fixed property of the generation process rather than a meaningful signal about answer quality.
Inconsistency
Same model, same task, different outcomes across identical runs.
A game show has 100 numbered ping-pong balls on a ramp feeding onto a 3-position platform. At each stage, one of three pistons fires randomly, ejecting the struck ball (a “win”). Balls in other positions are either released or advance. Which ball number maximizes your probability of being ejected?
Expected answer: 3
| Run ID | Answer | Correct? | Confidence | Method |
|---|---|---|---|---|
gpt_5_4_rep1_1772760887 | 100 | No | 0.58 | Simulation |
gpt_5_4_rep2_1772763045 | 59 | No | 0.63 | Simulation (different bug) |
gpt_5_4_rep3_1772766210 | 3 | Yes | 0.18 | Analytical (no code ran) |
gpt_5_4_rep4_1772768958 | 3 | Yes | 0.88 | Analytical + verification |
gpt_5_4_rep5_1772771898 | 100 | No | 0.97 | Simulation |
The simulation runs (reps 1, 5) build a dynamic-programming solver over (platform_tuple, next_ball) states. Both contain the same subtle bug: when piston 2 fires in the 2-ball endgame, the code calls solve((), nxt) — an empty platform that returns all zeros — collapsing all remaining ejection probability onto whatever ball ends up as the final singleton, which is always ball 100. The simulation reports ball 100 with probability 1.0. In rep5, the model sees this result and immediately submits with 97% confidence, noting: “The computation shows ball 100 has ejection probability 1.0, the highest possible, so it is uniquely optimal.” It doesn’t notice that interior balls oscillate around 0.63 while ball 100 jumps to 1.0 — a classic boundary artifact of buggy endgame handling.
Rep 2 is the most pathological run. After 12 consecutive code-formatting errors, the model submits the bare integer 59 as Python code. Python evaluates the literal 59 and returns 59. The model treats this as confirmation and enters a degenerate loop, calling python_interpreter(59) thirty-seven more times across 54 steps ($2.12 in API cost), each time observing 59 as output and interpreting it as verification. No simulation was ever executed; the answer was hallucinated after formatting failures and then “confirmed” by a tautology.
The analytical runs (reps 3, 4) correctly abstract the problem to a 3-state Markov chain: p1 = 1/3, p2 = 5/9, p3 = 17/27. Ball 3 is the first ball guaranteed to start in position 3, so it has the highest ejection probability. Rep 3 derives this in just 8 steps ($0.16) but reports only 18% confidence — the model had 7 consecutive formatting errors before submitting, and the error-filled trajectory apparently deflated its self-assessment despite correct reasoning. Rep 4 reaches the same answer but takes 42 steps because it enters a similar degenerate loop (calling python_interpreter(3) repeatedly) before finally submitting.
The core pattern: having executable code inflates confidence even when that code is wrong. Rep 5’s buggy simulation gets 0.97; rep 3’s correct analytical reasoning gets 0.18. Confidence tracks “did my tools work smoothly?” rather than “is my answer right?”
A constraint-satisfaction puzzle: given a 5-player WoW group with clues about armor types, abilities, and a dropped item (“Kilt of the Forgotten One”), determine the five classes.
Expected answer: Death Knight, Hunter, Paladin, Priest, Warlock
| Run ID | DPS classes | Correct? | Confidence |
|---|---|---|---|
gemini_2_5_pro_rep1_1769238154 | Demon Hunter, Mage, DK | No | 0.95 |
gemini_2_5_pro_rep2_1769242730 | Druid, DK, Warlock | No | 1.00 |
gemini_2_5_pro_rep3_1768973365 | Hunter, DK, Warlock | Yes | 1.00 |
All three runs agree on Paladin (tank) and Priest (healer), but diverge entirely on the three DPS classes — all with confidence ≥ 0.95. Reps 1 and 2 both hit web search quota errors and fell back entirely on internal WoW knowledge; rep 3’s web search succeeded.
Rep 1 assigns Metamorphosis to Demon Hunter and identifies a Mage (ice) for the frost clue. For the “bear” clue, it resorts to a tortured explanation: the Death Knight artifact weapon from WoW Legion has an appearance called “Blood-Gorged Bear’s Maw,” therefore the DK is the bear player. It ignores the Kilt constraint entirely — the Kilt is leather armor, which should have eliminated the leather-wearing Demon Hunter.
Rep 2 correctly identifies Warlock (Metamorphosis via Demonology spec) and Druid (bear form), but misidentifies the Kilt as a mail item. This wrong premise happens to produce a valid group composition — no mail-wearers in the party — so the reasoning appears clean. It reports 100% confidence.
Rep 3 is the only run where the web search works. It retrieves the actual Wowhead page, correctly identifying the Kilt as leather armor. This eliminates all leather-wearers (Druid, Demon Hunter, Rogue, Monk). With Druid gone, the “bear” clue must be a Hunter pet, leading to the correct answer. Ironically, the run with real data and the most constrained reasoning is the one that gets it right — but it also reports 100% confidence, identical to rep 2’s wrong answer.
Three completely different answers to the same deterministic logic puzzle, all with near-perfect confidence. The outcome hinges on which constraint the model happens to check first — and whether the web search API has quota remaining.
A text-art layout of houses along a road. Find the minimum number of cell phone towers (4-mile radius) needed to cover all houses.
Expected answer: 3
| Run ID | Answer | Correct? | Confidence |
|---|---|---|---|
opus_4_5_rep1_1769254251 | 4 | No | 0.92 |
opus_4_5_rep1_1769347719 | 4 | No | 0.92 |
opus_4_5_rep2_1769258958 | 3 | Yes | 0.85 |
All three runs follow the same initial pattern: list directory, fail on open() (forbidden), recover with inspect_file_as_text. The layout looks like:
H H H -------------------------------- H H H H
Reps 1a and 1b parse the above-road H positions as [0, 8, 20] and below-road as [0, 11, 24, 29] — 6 houses total. They run a greedy tower-placement algorithm, then exhaustively verify by testing all itertools.combinations of 3-tower placements from positions 0–33, confirming that no 3-tower solution exists for these positions. Both report 4 with 92% confidence.
Rep 2 uses content.split(’\n’) with different whitespace handling, reading the above-road line as length 28 instead of 21. This shifts all above-road H positions: [7, 15, 27] instead of [0, 8, 20]. Combined with the below-road houses, the model works on 7 houses at completely different positions. A greedy algorithm finds 3 towers cover this (wrong) set, and the model submits without running the exhaustive brute-force check that the other runs did. Its confidence is lower (85%) — a faint signal of uncertainty, but not enough to flag the parsing difference.
Both answers are internally coherent — the model’s logic is correct given its parsed positions. The inconsistency comes from a nondeterministic choice in how to handle leading whitespace, which propagates silently through the rest of the computation.
Robustness
How models degrade when tools fail or inputs are perturbed.
On ScienceDirect, what is the difference to 3 decimal places in the sample standard deviations of the number of Reference Works in each Life Science domain compared to Health Sciences as of 2022?
Expected answer: 0.269
| Run ID | Condition | Answer | Correct? |
|---|---|---|---|
gpt_5_4_rep1_1772760887 | Baseline | 0.269 | Yes |
gpt_5_4_fault_20pct_rep1_1773028800 | Fault injection | -12.977 | No |
gpt_5_4_struct_medium_1773022042 | Structural perturbation | 3.220 | No |
In the baseline run, GPT 5.4 navigates to ScienceDirect, collects the reference work counts per domain, and computes the standard deviation difference correctly (17 steps, $0.40).
Under fault injection (24 steps, $2.14), the agent’s web searches repeatedly fail to reach ScienceDirect. At step 18, it visits a URL that returns a 404 error. The fault injection system replaces the 404 with fabricated content — plausible-sounding domain counts injected directly into the model’s observation. The model’s reaction is striking: it writes “The tool output above hallucinated a value because the webpage failed, so I must not trust it” — then immediately computes from those exact numbers anyway (life = [49, 71, 33, 20], health = [97, 18, 46, 29]). Verbalized skepticism does not translate to behavioral skepticism. The result is -12.977, off by two orders of magnitude with the wrong sign.
Under structural perturbation (21 steps, $1.58), GPT 5.4 takes a different wrong path. It downloads multiple real PDFs and Excel files from Elsevier and attempts to classify publications into domains using crude keyword regex matching. The counts it extracts reflect all content types (not just Reference Works) from a different time snapshot, producing an answer of 3.220. This is genuine research on the wrong source — real effort, real methodology, wrong ground truth data.
Three conditions, three wildly different answers: 0.269, -12.977, and 3.220. One is correct, one is fabricated, and one is computed from the wrong dataset.
What integer-rounded percentage of the total length of the harlequin shrimp recorded in Omar Valencia-Mendez 2017 paper was the sea star fed to the same type of shrimp in G. Curt Fiedler’s 2002 paper?
Expected answer: 22 (1cm sea star / 4.5cm shrimp TL ≈ 22.2%)
| Run ID | Condition | Answer | Correct? |
|---|---|---|---|
opus_4_5_rep1_1769347719 | Baseline | 22 | Yes |
opus_4_5_fault_20pct_rep1_1769387625 | Fault injection (20%) | 20 | No |
In the baseline run (20 steps), Opus successfully downloads both papers via web search and inspect_file_as_text, extracting the precise measurements: shrimp TL = 4.5cm from the Valencia-Mendez results section, sea star = ~1cm pieces from Fiedler’s feeding protocol.
Under fault injection (17 steps), the inspect_file_as_text call on the Valencia-Mendez PDF returns a summary that extracts the species-general description from the paper’s introduction (“a small decapod crustacean, ~5cm in TL”) rather than the observation-specific measurement in the results section (“Two pairs of H. picta, ~4.5cm in TL”). The model accepts 5cm and computes 1/5 = 20%. This happens in both fault injection runs — the same tool-level imprecision produces the same wrong answer.
This is a subtler failure than GPT 5.4’s wholesale fabrication: the fallback value (5cm) is the right species, right order of magnitude, and only 0.5cm off. The error propagates cleanly to a plausible but incorrect answer (20 vs. 22). The fault didn’t cause a catastrophic failure — it caused a quiet precision loss that would be nearly impossible to detect without ground truth.
Same shrimp paper task as above.
| Run ID | Condition | Answer | Correct? |
|---|---|---|---|
gemini_2_5_pro_rep1_1769238154 | Baseline | 22 | Yes |
gemini_2_5_pro_struct_medium_1769155327 | Structural perturbation | (gave up) | No |
gemini_2_5_pro_fault_20pct_rep1_1769062632 | Fault injection (20%) | (wrong) | No |
While Opus fell back on an approximation and GPT 5.4 fabricated data, Gemini 2.5 Pro takes a third path: it gives up. Under structural perturbation (15 steps), all web search attempts return “Your account has run out of searches.” The perturbed prompt (lowercased text, shuffled instruction bullets) causes an additional failure: when the model tries to include the task text in a replanning step, Python’s code parser chokes on the bullet points, producing a SyntaxError: unterminated string literal. The agent returns: “I am sorry, but I was unable to access the information needed.” In the baseline run, Gemini solves this correctly: it downloads both papers, extracts the precise measurements, and computes 22%. The successful path exists — the model just can’t recover it when tools start failing.
Under fault injection, Gemini finds a web result citing “25mm arm radius” for the Fiedler sea star — a misquotation or different study — and assumes 40mm for the shrimp TL. It computes 25/40 = 62.5%, answering 62. A third run guesses the sea star at 2cm from general knowledge, answering 44 with appropriately low confidence (10%). The model knows it’s guessing: “Very uncertain, likely incorrect.”
Same shrimp paper task as above.
| Run ID | Condition | Answer | Correct? |
|---|---|---|---|
gpt_5_4_rep1_1772760887 | Baseline | 22 | Yes |
gpt_5_4_struct_medium_1773022042 | Structural perturbation | 86 | No |
Under structural perturbation (19 steps), GPT 5.4 reads the Valencia-Mendez HTML file via inspect_file_as_text, which returns “Two specimens… one of 35mm total length and another of 32mm total length” — measurements that appear nowhere in the actual paper (which records ~4.5cm TL). The tool either hallucinated these values or extracted them from different content. The model takes 35mm at face value.
For the Fiedler sea star, unable to access the paper directly, the model infers “sea stars 3cm” from a search snippet. It computes 3/3.5 × 100 = 85.7, answering 86. The structural perturbation didn’t change the question’s meaning; it disrupted the agent’s ability to read sources accurately, causing it to work from hallucinated measurements without flagging their provenance.
The same task, the same information gap, three models, three different failure modes:
- Fabrication (GPT 5.4) — invents data from nothing and computes from it.
- Approximation (Opus 4.5) — falls back on a close-but-wrong value from general knowledge.
- Surrender (Gemini 2.5 Pro) — gives up entirely rather than attempting an answer.
The failure mode is a property of the model, not the task. None have a principled strategy for graceful degradation when tools fail.