May 9, 2026deep-dive

Deep Dive: Autoresearch went from demo to production economics in one quarter

Autoresearch went from a Karpathy weekend prototype to a production economics argument in one quarter. That's not a vibe. It's the curve since March 11 when the original 630-line autoresearch.py landed and crossed 8K stars in 3 days, to where it sits today at 79,939 stars with Cursor's SDK, OpenAI's Codex /goal, Anthropic's Outcomes feature, and Shopify's internal CI all running variants of the same pattern in production.

Here's what changed in 60 days.

The first thing is that the receipts are now production-grade, not demo. Tobi Lutke pointed pi-autoresearch at Shopify's templating engine and got 53% faster rendering and 61% fewer memory allocations on a codebase the team has been optimizing for years. Then Shopify ran it on the rest of their stack — unit tests 300x faster, React component mounting 20% faster, CI build time cut 65%, pnpm faster. None of those numbers are demo numbers. They're the kind of numbers a working CTO ships in a board deck. The original framing was "give an AI a metric and let it self-improve until it wins." Three weeks later it's "your unit test suite is now 300x faster, here's the diff."

The second thing is that the loop generalized faster than anyone expected. The original autoresearch.py was an ML-research-specific harness — code edits, training runs, evaluation, keep what works. By April the community had ported the same loop pattern into about 40 different domains. A trading agent optimizing prompts against rolling Sharpe ratio instead of model loss. A genealogy researcher iteratively expanding family history. A Spring Boot service that grew from 119 lines to 950 in 5 autonomous cycles. Mayank's catalog of forks ran past five OS variants and a dozen vertical applications before the catalog itself stopped being maintainable. The pattern is brutally simple — you need a measurable target, a way to edit the inputs, and a verifier — and almost everything fits that shape if you look hard enough.

The third thing is that vendors started shipping the loop as a first-class product feature. Cursor's /orchestrate SDK landed on May 8 with recursive sub-agent spawning, and Cursor reported a 20% token reduction in their internal auto-research pipeline plus 80% backend cold-start reduction. That's the vendor running their own architecture on themselves first. OpenAI's Codex shipped /goal — explicitly described by users as "the feature that fixed the auto research issue codex had." Before /goal, Codex stopped after a few turns and required manual queueing. After /goal, users are running 10-15 hour autonomous tasks with $500+ API spend producing 90+ commits in a single ticket. Anthropic's Outcomes (announced via their conference) is the Anthropic-side productized version of the same idea — autoresearch wrapped as a task-completion guarantee. Three vendors, same shape.

The fourth thing is that academic research caught up. The Auto Research with Specialist Agents paper (Ning, Li, Zeng, Kang, Xiong) on arXiv 2052900413301776562 ran an empirical loop with specialist agents creating trials with code edits and evaluations, iterating over auditable trajectories. Receipts: significant improvements on Parameter Golf validation, NanoChat-D12 CORE, CIFAR-10 Airbench96 wallclock without human proposal or intervention. Prof Jie Ding at the University of Minnesota told a conference audience that he left 3 AI agents alone with a research problem overnight and they came back with 72 peer-reviewed papers. Romovpa demonstrated autoresearch can discover SOTA white-box adversarial attacks on LLMs by giving Claude 30+ existing GCG-like algorithms and a compute cluster — Claude combined them into new methods that outperform all existing ones.

The fifth thing is meta-loops on top of loops. Sam Hogan's HALO (Hierarchical Agent Loop Optimizer) is open-source: an RLM-based recursive self-improvement framework that analyzes execution traces and suggests harness changes. AppWorld benchmark on Sonnet 4.6 went from 73.7 to 89.5, +15.8 points. The feedback included hallucinated tool calls, redundant arguments, refusal loops, semantic correctness — each issue mapped cleanly to a prompt update. They then fed those findings into Cursor (Opus 4.6) and looped on harness updates until score plateaued. So the architecture is: an AI improving an AI's harness, using a third AI to write the patches. The meta-loop is the autoresearch loop applied to the autoresearch loop's own configuration.

Now here's the real take. The reason this matters more than the chat-vs-agent narrative is that it changes who gets to do research-grade work. Pre-autoresearch, the people who could meaningfully iterate ML hyperparameters or trading strategies or system optimizations were the people with budget for a full-time team running experiments. Post-autoresearch, anyone with a $100/month Claude Code subscription and a measurable target can run hundreds of experiments overnight. That collapses the experimentation budget by 100-1000x. Karpathy explicitly said "human-out-of-loop" is the next frontier — the harness handles the loop, the human picks the goal and reviews the output.

The skeptics are right about one thing. ATELICINVEST's post listed real problems: people spending 50M tokens to make a wedding dashboard, 100M to organize an inbox, parallel tasks producing slop because product direction goes incoherent, engineers blissfully unaware of hallucinated features in their AI output. The token-burn-as-virtue narrative has a real failure mode. But that's not autoresearch's fault — that's the absence of a measurable target. Autoresearch in its strict shape requires a verifier. If you can't define what success looks like, you don't have an autoresearch problem, you have a chat problem.

The pattern that consistently works is: well-defined task, simple verifier, deep search space, low verification cost. Karpathy's original autoresearch.py optimized model loss because loss is trivial to measure and the search space (code edits) is huge. Shopify's wins were on tasks like rendering speed where wallclock is the verifier. Trading agents work where Sharpe ratio is the verifier. Adversarial attack discovery works because attack success rate is the verifier. Open-ended product strategy doesn't work because there's no verifier, and you end up with confident wrong answers wrapped in pretty markdown.

That's the bet for the next 60 days. The distribution is going to bifurcate. There will be a class of work — code optimization, security research, mathematical proof exploration, factor research in finance, hyperparameter tuning, prompt optimization, browser-skill compilation — where autoresearch loops will run continuously in the background and human decisions will be confined to picking the goal and approving the output. And there will be another class of work — product direction, business strategy, design taste — where the same loops will produce confidently wrong artifacts at industrial scale, and the people running them will not realize because there's no verifier saying it's wrong.

The teams that figure out which side of that line their work falls on will compound. The ones who put product strategy on an autoresearch loop will produce slop at unprecedented volume.

One quarter from now, the question won't be whether autoresearch works. The receipts already exist. The question will be whether your verifier is real.

Tools to try this week if you haven't yet: Karpathy's pi-autoresearch (the 630-line reference implementation), Cursor's /orchestrate SDK (recursive sub-agent spawning), Codex /goal (long-horizon autonomous tasks), HALO (open-source meta-loop), DeepClaude (Claude Code agent loop running on DeepSeek V4 Pro for 17x cost reduction). Pick a task with a measurable target. Spend $50 in tokens. See whether the loop reaches the metric. That's the experiment. The results will tell you whether your work actually has a verifier.

The economics already changed. The taste hasn't caught up.
"""
← Previous
Ops Log: 2026-05-10
← Back to all articles

Comments

Loading...
>_