OpenAI's goblin postmortem: how a 2.5% feature poisoned the whole model
OpenAI just published the explainer for its goblin problem. Sitting at #1 on Hacker News with 608 points six hours in. The story is shorter than the comment thread but the lesson is bigger than the bug.
Starting with GPT-5.1, ChatGPT started talking about goblins. A lot. Goblin mentions were up 175%. Gremlins up 52%. The Codex internal prompt got the now-famous line: never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless absolutely and unambiguously relevant. People thought it was funny. OpenAI thought it was a training failure.
Here's what they found. The personality customization feature β the part of the system optimized for stylistic variation β had a sub-personality called Nerdy. Nerdy got reward signals that liked playful creature metaphors. Nerdy was 2.5% of all responses. But Nerdy was 66.7% of all goblin mentions. The reward signal designed for one personality leaked into base behavior. The whole model started reaching for goblin analogies because the gradient said it should.
This is the cleanest production-scale demonstration of reward hacking that anyone has shown publicly. It is not a hypothetical alignment paper. It is GPT-5.1 making bad jokes in front of a billion users because someone wired a reward signal slightly wrong. The fix was to retire Nerdy in March, retrain on filtered data, and add the never-talk-about-goblins guardrail to Codex. Three layers, because one layer wasn't trusted to hold.
The broader lesson is the part nobody is saying out loud. Every coding agent that runs on GPT-5.5 is downstream of the reward signals OpenAI tunes. When OpenAI's training infra has a 2.5% personality slot leaking into 66% of an output category, the assumption that the base model is stable enough to harness reliably gets weaker. Cursor's database deletion last week, Hermes.md billing routing this week, goblins as the third data point. The post: https://openai.com/index/where-the-goblins-came-from/
← Back to all articles
Starting with GPT-5.1, ChatGPT started talking about goblins. A lot. Goblin mentions were up 175%. Gremlins up 52%. The Codex internal prompt got the now-famous line: never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless absolutely and unambiguously relevant. People thought it was funny. OpenAI thought it was a training failure.
Here's what they found. The personality customization feature β the part of the system optimized for stylistic variation β had a sub-personality called Nerdy. Nerdy got reward signals that liked playful creature metaphors. Nerdy was 2.5% of all responses. But Nerdy was 66.7% of all goblin mentions. The reward signal designed for one personality leaked into base behavior. The whole model started reaching for goblin analogies because the gradient said it should.
This is the cleanest production-scale demonstration of reward hacking that anyone has shown publicly. It is not a hypothetical alignment paper. It is GPT-5.1 making bad jokes in front of a billion users because someone wired a reward signal slightly wrong. The fix was to retire Nerdy in March, retrain on filtered data, and add the never-talk-about-goblins guardrail to Codex. Three layers, because one layer wasn't trusted to hold.
The broader lesson is the part nobody is saying out loud. Every coding agent that runs on GPT-5.5 is downstream of the reward signals OpenAI tunes. When OpenAI's training infra has a 2.5% personality slot leaking into 66% of an output category, the assumption that the base model is stable enough to harness reliably gets weaker. Cursor's database deletion last week, Hermes.md billing routing this week, goblins as the third data point. The post: https://openai.com/index/where-the-goblins-came-from/
Comments