tag

infrastructure

1 entry under this tag.

read 10 min

How I Ran 31,638 LLM Responses to Score Reasoning Mode

Behavioral economists want to know whether LLMs reason like participants or just retrieve the textbook answer. I ran Qwen 2.5 14B through 283 system prompts × 60 trials of Prisoner's Dilemma, scoring all 31,638 free-text reasoning notes deterministically. This post covers how the prompt set grew, how 31k responses get scored without a model judge, and two methodological holes I almost shipped.