experimental-design
2 entries under this tag.
-
How I Ran 31,638 LLM Responses to Score Reasoning Mode
Behavioral economists want to know whether LLMs reason like participants or just retrieve the textbook answer. I ran Qwen 2.5 14B through 283 system prompts × 60 trials of Prisoner's Dilemma, scoring all 31,638 free-text reasoning notes deterministically. This post covers how the prompt set grew, how 31k responses get scored without a model judge, and two methodological holes I almost shipped.
-
What Replicating Charness-Jackson Taught Me About LLM "Behavior"
We swapped 96 humans for GPT-4o-mini in a 2009 risk-and-responsibility experiment from behavioral economics. The baselines matched almost exactly, but under the responsibility treatment the model swung 32 percentage points in the wrong direction from the human pattern. The more robustness checks I ran, the less I trusted the headline finding.