research.gradstudent.me

date Feb 2026

read 6 min

tags agentic-systems · llm-evals · failure-analysis

Five Agentic-LLM Failure Modes That Aren't Actually LLM Problems

When an agentic LLM does the wrong thing in production, the instinct is to rewrite the prompt. Most of the time the actual fix lives somewhere else, in the tool API, the dispatcher, or the lookup database. Five real failure modes from a Discord bot controlling Old School RuneScape, and the layer each one actually lives in.

date Feb 2026

read 13 min

tags llm-evals · agentic-systems · tool-use · qwen

Why qwen2.5:14b Pretends to Execute Commands, and What Actually Fixed It

When you run a tool-using LLM, "faking" is when it tells you it did something without ever calling the tool. A Discord bot doing this on 14 of 38 commands went to 16 of 16 after three structural changes. Adding "THIS IS LIVE" to the system prompt did nothing; the prompt was never the problem.

date Jan 2026

read 7 min

tags agentic-systems · llm-evals · observability · gemini

How a Self-Referential Field Cost $26 in Two Hours of Autonomous Gameplay

Autonomous agents on metered LLM APIs can burn real money if their prompts grow without anyone watching. My two-hour OSRS-bot session billed $26 because one Python field was both read and overwritten in the same operation; the user message grew to 82,000 chars before I noticed. Token tracking returned zero the whole time, so the cost stayed invisible until the invoice arrived.

date Jan 2026

read 10 min

tags llm-evals · infrastructure · experimental-design

How I Ran 31,638 LLM Responses to Score Reasoning Mode

Behavioral economists want to know whether LLMs reason like participants or just retrieve the textbook answer. I ran Qwen 2.5 14B through 283 system prompts × 60 trials of Prisoner's Dilemma, scoring all 31,638 free-text reasoning notes deterministically. This post covers how the prompt set grew, how 31k responses get scored without a model judge, and two methodological holes I almost shipped.

date Aug 2025

read 13 min

tags llm-evals · behavioral-economics · experimental-design

What Replicating Charness-Jackson Taught Me About LLM "Behavior"

We swapped 96 humans for GPT-4o-mini in a 2009 risk-and-responsibility experiment from behavioral economics. The baselines matched almost exactly, but under the responsibility treatment the model swung 32 percentage points in the wrong direction from the human pattern. The more robustness checks I ran, the less I trusted the headline finding.