research.gradstudent.me
Notes on LLM measurement infrastructure and experimental design.
-
Five Agentic-LLM Failure Modes That Aren't Actually LLM Problems
When an agentic LLM does the wrong thing in production, the instinct is to rewrite the prompt. Most of the time the actual fix lives somewhere else, in the tool API, the dispatcher, or the lookup database. Five real failure modes from a Discord bot controlling Old School RuneScape, and the layer each one actually lives in.
-
Why qwen2.5:14b Pretends to Execute Commands, and What Actually Fixed It
When you run a tool-using LLM, "faking" is when it tells you it did something without ever calling the tool. A Discord bot doing this on 14 of 38 commands went to 16 of 16 after three structural changes. Adding "THIS IS LIVE" to the system prompt did nothing; the prompt was never the problem.
-
How a Self-Referential Field Cost $26 in Two Hours of Autonomous Gameplay
Autonomous agents on metered LLM APIs can burn real money if their prompts grow without anyone watching. My two-hour OSRS-bot session billed $26 because one Python field was both read and overwritten in the same operation; the user message grew to 82,000 chars before I noticed. Token tracking returned zero the whole time, so the cost stayed invisible until the invoice arrived.
-
How I Ran 31,638 LLM Responses to Score Reasoning Mode
Behavioral economists want to know whether LLMs reason like participants or just retrieve the textbook answer. I ran Qwen 2.5 14B through 283 system prompts × 60 trials of Prisoner's Dilemma, scoring all 31,638 free-text reasoning notes deterministically. This post covers how the prompt set grew, how 31k responses get scored without a model judge, and two methodological holes I almost shipped.
-
What Replicating Charness-Jackson Taught Me About LLM "Behavior"
We swapped 96 humans for GPT-4o-mini in a 2009 risk-and-responsibility experiment from behavioral economics. The baselines matched almost exactly, but under the responsibility treatment the model swung 32 percentage points in the wrong direction from the human pattern. The more robustness checks I ran, the less I trusted the headline finding.