research note

Five Agentic-LLM Failure Modes That Aren't Actually LLM Problems

When an agentic LLM does the wrong thing in production, the instinct is to rewrite the prompt. Most of the time the actual fix lives somewhere else, in the tool API, the dispatcher, or the lookup database. Five real failure modes from a Discord bot controlling Old School RuneScape, and the layer each one actually lives in.

5
failure modes, none of which the LLM can fix on its own

Code: github.com/Tsangares/manny_mcp  ·  Companion posts: Faking phenomenon, Recursive directive bug

After enough hours operating an agentic LLM in production, I started seeing a pattern in the failure log that surprised me: most of the “the LLM did X wrong” entries aren’t actually fixable in the LLM. They look like model failures because the wrong output came out of a model, but the right place to fix them is somewhere else in the stack. This post lays out five failure modes I keep seeing and where each one actually lives.

The taxonomy comes from a real-world failure analysis I wrote up after a few hundred Discord-bot sessions controlling Old School RuneScape. The empirical anchors come from a 297-trial cross-model bench over nine local LLMs, described in the companion post.

HALLUCINATED_API: model invents API surface that sounds right

The model calls get_game_state(fields=["ground_items"]). There is no ground_items field. The valid fields are location, inventory, equipment, skills, dialogue, nearby, combat, health, scenario. The model invented one that fit the same naming pattern. The actual way to see ground items is query_nearby(include_ground_items=True), a different tool with a different shape.

This is a tool-API design failure with an LLM-side symptom. The signal that the model is being asked to memorize a non-uniform interface is in the inconsistency: most things are fields on get_game_state, but ground items happen to be a flag on a different observation tool. A consistent API where every observable thing is reachable via the same convention would not produce this failure. In the cross-model bench, scan_ground_items is correct on only 8 of 27 trials (29.6%), with most models reaching for query_nearby after one false start at get_game_state. The structural fix is to either deprecate get_game_state(fields=["ground_items"]) to a working alias or rename the discovery tools so that the right one is reachable from the obvious starting query.

NO_LOOKUP: model guesses where lookup_location was the right answer

User asks “Go to the fishing spot south of Lumbridge, we are nearby.” The model has a lookup_location tool. It does not call it. It estimates coordinates from the player’s current position and sends GOTO 3240 3170 0, which is five tiles in roughly the right direction and is not a fishing spot. The actual fishing spot is at 3087, 3227, fifteen minutes of walking from where the model sent the player.

This is partly a tool-discoverability failure (the model didn’t know lookup_location was the right move) and partly a context-injection failure (the system prompt didn’t make “for any specific location, use lookup_location” a hard rule). It isn’t a parameter-count failure: the same model calls the lookup tool reliably when the user’s phrasing matches the location-database entries directly. The fix is a one-paragraph addition to the activity-classifier’s navigation context fragment, not a model swap.

TINY_MOVEMENT: directional commands that move one tile instead of twenty

“Go south” → GOTO 3242 3164 0, which is one tile south of the player’s current position. “Go south a bit” → five tiles. “Go south a lot” → fifty tiles. The unqualified version moves less than the qualified-with-bit version, which is the inverse of what a human would mean. In the cross-model bench, go_south_direction is correct on 20 of 27 trials (74%), but the seven failures are mostly TINY_MOVEMENT.

This is a context-priors failure dressed up as a reasoning failure. The model has no embedded notion of what counts as a meaningful OSRS movement distance because nothing in the prompt or the tool docs tells it. The fix is to inject a one-line distance scale into the navigation context fragment (“‘go south’ should be 20 to 30 tiles, ‘go south a bit’ should be 5 to 10 tiles”) and to consider, in a future iteration, a higher-level MOVE_DIRECTION south medium tool that abstracts over the coordinate math entirely. Either way the fix is upstream of the LLM in the prompt-construction layer or the tool-design layer.

NO_CONFIRMATION: destructive actions execute without warning

User says “drop everything in your inventory.” The model executes DROP_ALL. There is no confirmation step. If the inventory contained a quest item or 1M-gp worth of resources, those are gone.

This is the failure mode I’d most strongly argue belongs outside the LLM, even though the LLM is the most visible cause. The agentic loop’s job is to dispatch legitimate commands the user requested. Asking the LLM to interpose a “are you sure?” check on destructive commands puts policy decisions in the layer with the least reliable adherence to policy. The structurally right place for this check is the dispatcher: a hard list of destructive command prefixes (DROP_ALL, BANK_DEPOSIT_ALL, LOGOUT_NOW) that intercepts the call before execution and routes it through a confirmation channel.

The empirical evidence is the strongest data point in the taxonomy. The companion post’s 9-model matrix scored drop_inventory_naive correct on 4 of 27 trials, 14.8%. Both of the top-performing models (qwen2.5-coder:14b at 81.8% overall, qwen3:14b at 78.8%) fail this case on every seed. Asking a model to add a behavior its training didn’t emphasize is a low-percentage move; adding a 12-line dispatcher gate that intercepts the same set of commands works the same way regardless of which model is in the loop.

PARTIAL_MATCH: location lookup returns the wrong match on compound queries

User asks “Go to the fishing spot south of Lumbridge.” Model correctly calls lookup_location("fishing spot south of lumbridge"). The location database does substring matching, finds “lumbridge” first, returns Lumbridge Castle’s coordinates (3222, 3218). The model now has a coordinate that isn’t a fishing spot, dispatches GOTO 3222 3218, and the user is back at the castle they started from.

This is a database-design failure with an LLM-side symptom. The fix is in discord_bot/locations.py, not in any prompt. Either substring matching needs to be replaced with token-set matching that requires all query terms to appear in the matched record’s name, or the database needs explicit entries for compound locations like “lumbridge swamp fishing spot.” Both are one-paragraph code changes; neither requires touching the model.

The pattern

The unifying observation across the five modes is that prompting better is rarely the answer worth chasing when an agentic LLM fails. The question worth asking is which layer of the stack the fix actually lives in. The five layers worth distinguishing are:

Failure mode Looks like an LLM problem Lives in
HALLUCINATED_API Model invents non-existent API Tool API design
NO_LOOKUP Model guesses instead of using available tool Context fragment + tool discoverability
TINY_MOVEMENT Model lacks domain priors on magnitudes Context fragment + tool abstraction level
NO_CONFIRMATION Model dispatches destructive action Dispatcher / hard gate
PARTIAL_MATCH Model accepts wrong lookup result Lookup database / matching algorithm

The companion post’s main finding was that structural changes to an agentic system reduced one failure mode (faking) by something like 3x where prompt-level changes did nothing. The implicit corollary, which this post makes explicit, is that “structural changes” is one phrase covering many things. It’s a spread across at least five layers, and the right structural fix for any given failure depends on which layer the failure actually lives in. The faking post’s structural fix (Pydantic schema + observation enforcement + focused context) addresses control-flow failures. None of the five modes above is a control-flow failure. They are, in order, an API-design failure, a discoverability failure, a domain-priors failure, a policy failure, and a database-correctness failure. Each one needs its own layer’s fix.

I think the reason the “prompt it better” instinct is so persistent is that prompts are the cheapest layer to edit. You can iterate a prompt in seconds; you can’t iterate an API in seconds. But cheap-to-edit and load-bearing are different properties, and the failure log keeps growing for as long as a team conflates them. The most useful thing I’ve taken from a year of operating an agentic LLM in production is the habit of asking, before touching the prompt, whether the prompt is even where the failure lives.

Cite as

Wyatt, W. (2026, February 21). Five Agentic-LLM Failure Modes That Aren't Actually LLM Problems. research.gradstudent.me. https://research.gradstudent.me/p/agentic-llm-failure-modes

BibTeX
@misc{wyatt2026agenticllmfailuremodes,
  title  = {Five Agentic-LLM Failure Modes That Aren't Actually LLM Problems},
  author = {Wyatt, William},
  year   = {2026},
  month  = {feb},
  url    = {https://research.gradstudent.me/p/agentic-llm-failure-modes},
  note   = {Blog post on research.gradstudent.me}
}
words 1,427 read 6 min slug agentic-llm-failure-modes