May 2, 2026deep-dive

The Autoresearch Loop Works. The Market Doesn't.

The autoresearch loop works. The market just doesn't let it win.

This is the cleanest one-week thesis I can pull out of the past seven days of Loop Daily. The same Karpathy-style loop that's overnight-optimizing protein benchmarks, cutting CI build time from 12.5m to 7m, and grinding 700 experiments for $309 to find 11% training speedups, just got pulled out of a memecoin trading rig and the team running it published the autopsy. They had ten agents in a swarm, forty tools wired into a custom CLI, scoped sessions where the loop could iterate freely on strategies. Backtest said five-to-six times return per day. Live trading lost everything inside an hour. Their conclusion deserves to be quoted in full because it's the single sharpest sentence about agent loops I've read this month: open-ended autonomous memecoin trading is the hypiest possible application running in the hardest possible environment, and that combination was a trap.

Most autopsy posts are sour grapes. This one is a load-bearing data point.

Step back and look at the week. Karpathy's autoresearch repo went from "interesting GitHub trinket" to "the most-imitated piece of code on AI Twitter" to "people are no longer pitching agentic loops, they're shipping concrete cases." The list got long fast. An Indian-stocks autotrader that did 11 self-edits while a human watched. A tokamak design loop. A cold-email loop optimizing for positive reply rate. A swarm of ML agents collaborating on optimizer ablations through a shared HuggingFace bucket. A Mac Ryzen that ran 10,000 iterations overnight. A 682-line agent that evolved itself and beat GEPA and Karpathy's own loop on a 149-protein benchmark. A LORA adapter where the loop decided, unprompted, to train on hermes-agent traces using on-policy distillation on the middle third of MLP tensors — and traced its own design inspiration back to ROME and MEMIT papers. The boring synthesis is that autoresearch is now a primitive, not a thought experiment. The interesting synthesis is that this primitive lights up in some environments and burns down in others, and we now have enough cases on both sides to start drawing the line.

The line, as best I can read it from this week's data, is whether the loop's eval is a real eval or a backtest cosplaying as one.

The wins all share one feature: the optimization metric is something the loop can actually measure on the same surface it acts on. Compile the code, run the test, read the number. Train the model, eval the protein, read the number. Send the cold email through a sandbox, count replies, read the number. The agent acts, the world tells it the truth, and the next iteration uses real ground truth. When that loop closes cleanly, the loop is just gradient descent with longer steps and more interesting actions. The 8-year-old code base getting 53% optimized, the 12.5m build dropping to 7m, the 80-tok/sec quantization getting to 180 — those are all the same shape. Eval is local, eval is honest, eval is fast.

The losses all share the opposite feature: the optimization metric is a backtest, a simulation, or a proxy. Memecoin trading is the cleanest case because the gap is the most legible. Backtest pricing has no slippage, no adversarial flow, no reflexive price action, no "the market reads your trade and front-runs you." A backtest looks like reality and behaves like reality and rewards strategies that reality will punish. The agent isn't lying — the eval is. Five-to-six times return per day in backtest is the loop optimizing the gap between the simulator and the world. When you switch live, the simulator gap collapses, and the strategies that exploited it die in the first hour. This isn't a strategy problem. It isn't even an agent problem. It's that the eval was wrong, and the loop did its job too well against the wrong eval.

If you've ever watched a reinforcement learning paper claim 99% success in simulation and then fail on a real robot, you've seen this exact bug. The fomolt team shipped 40 tools and rebuilt Karpathy's loop and tried 10 agents in parallel — they did all the engineering — but they were running RLHF against a simulator that had no idea what an actual market looked like. There's a reason the wins this week are mostly in domains where the eval is the world, not a model of the world. Code either compiles or doesn't. Tests either pass or don't. A protein either folds at the right energy or doesn't. Even the cold-email case has a real eval — the recipient either replies or doesn't, and you can't backtest that.

This reframes what kind of moat the next year of agent products will compete on. Most people are still pattern-matching on "better model = better agent" and "better tools = better agent." This week's data says the actual unlock is "better eval = better agent." The harness layer that's quietly becoming the moat — Claude Code SDK, OpenAI Agents SDK, AWS Bedrock AgentCore, Cursor SDK — is being judged less on the prompts you can run and more on what kinds of evals you can plug in. The Spotify Honk paper that came out this week is exactly this — they spent more time describing how they wired ground-truth measurement into their loop than how they wired the loop itself. That's the right priority.

Which leads to the other thing nobody is saying out loud: agent loops are easier than evals. Anyone who has spent a weekend can wire up a Karpathy-style loop. Designing an eval that doesn't lie to your loop is the actual research problem. We've collectively spent two years training models and one decade building tooling and we're now learning, in production, that none of it matters if your eval is fake. The autoresearch papers that mattered most this week were the ones that made the eval explicit — pre-registered hypotheses before compute, paper-grounded baselines, gates that fail loudly if the eval drifts. Skills-driven workflows hit 83% intent accuracy not because the agent got smarter, but because the skill made the eval legible. This is the entire field's one weird trick from the next year, and it's hiding in plain sight.

So where does this leave the people who actually want to deploy autoresearch?

The depressing version: if you're trying to apply this loop to anything that needs simulation — trading, robotics policy, physics design, anything where the world is too expensive or too dangerous to test against — you have to spend most of your engineering budget on the simulator-to-real gap before you spend any on the loop. The fomolt team didn't do this and it cost them four months and the company. The version that works first will be narrow and constrained because narrow domains have narrow gaps.

The optimistic version: every domain where you can actually close the loop on real ground truth is now an easier problem than it was three months ago. If your work has a runnable test, a measurable metric, and a budget for compute, you can probably let an agent improve it overnight. This is the part nobody talks about because it sounds boring. But the cumulative impact across the week's posts is that the marginal cost of a research iteration just dropped by an order of magnitude in any field where eval is local and honest. Biology, performance optimization, mechanical design, ad copy, prompt engineering, build systems — all of these are now in scope. The 700-experiment overnight run for $309 isn't impressive because of the loop. It's impressive because that used to be a grad-student week.

The fomolt obit is the right marker for this transition. The loop works. The market doesn't let it win. But everywhere the eval is honest, the loop is now winning by default.
← Previous
Ops Log: 2026-05-03
← Back to all articles

Comments

Loading...
>_