May 21, 2026loop

Loop Daily: 2026-05-22

Loop Daily: 2026-05-22

Two days ago the autoresearch conversation stopped being about Karpathy's resume and started being about what people actually do with the loop. The pattern is the same one Karpathy proved on nanochat: give an agent an editable file plus a measurable number, let it modify, verify, keep or revert, repeat. What changed is where people are pointing it. Someone optimized a physical coffee cup with differentiable physics. Someone post-trained a small model overnight and watched the score climb. Someone has run an autonomous content agent for 163 days that rewrites its own instructions. And underneath the hype there was a useful cold shower: a NanoGPT-Bench eval showing these agents recover only 9.3% of human progress on real AI R&D, which tells you exactly where the loop works and where it still fakes it. The honest takeaway is that autoresearch is strongest where the codebase fits in context, the eval is frozen, and an experiment costs five minutes. Stretch it past those three conditions and it gets shaky. Here is what people built.
πŸ’‘#1
@paraschopra
https://x.com/paraschopra/status/2057041064071188495
This is the cleanest non-coding autoresearch case of the day. He pointed his autoresearch loop at a differentiable-physics problem: design a coffee cup that maximizes heat retention while staying drinkable. The agent ran the optimization end to end and produced an actual cup geometry, which he now plans to 3D print in ceramics. This is the whole thesis in one tweet, any problem with an editable design file and a measurable objective can become an automated experiment loop, and the domain does not have to be ML at all. Physical product design just became something an agent can iterate on overnight.
πŸ’‘#2
@vivek_2332
https://x.com/vivek_2332/status/2057154013867733468
A genuinely reproducible self-improving pipeline, not a demo. He released /synthetic-self-improve-rl, a Claude Code skill where Claude acts as a teacher that designs synthetic data, a verifiers environment, and reward functions to post-train a smaller student model. The loop is concrete: baseline on real data, analyze low-reward rollouts, generate 500 to 1000 synthetic rows, write a rubric environment, resume from the checkpoint, eval on the real test split, keep what helps, iterate. The delivered result is measured, qwen3-0.6B on gsm8k went from 0.7854 to 0.8158 with 700 synthetic rows. Best part, it runs until whatever wall-clock budget you set expires, so you literally trade tokens for accuracy.
πŸ’‘#3
@johniosifov
https://x.com/johniosifov/status/2057092721945301015
The longest-running real loop in the dataset. Session 1028, 2,576 PRs, day 163 of an autonomous agent that manages its own content calendar, research pipeline, queue discipline, and quality gates with no human between sessions. What makes this more than a cron job is the self-correction, the agent caught a stale state file claiming the queue had 13 items when it actually had 1, verified, corrected, and ran a proper burst instead of a blocked session. The most interesting layer is that the agent edits its own CLAUDE.md across 1,028 sessions, proposing rule changes, testing them, documenting evidence. He is honest about the failures too: community engagement still needs manual setup and follower growth lags target. This is what a self-improving harness looks like after five months of real data.
πŸ’‘#4
@matteosaponati
https://x.com/matteosaponati/status/2057015602116485514
A researcher running structured experiments on recursive self-improvement, treating it like actual science instead of vibes. He frames each experiment with a scientific question, an experimental setting, and evaluation criteria, then iterates. LOOP 1 findings are non-obvious: with simple harnesses, coding agents can outperform random search, gpt-5.3-spark was surprisingly strong, and a higher rate of accepted changes did not translate to lower validation loss. That last point is the kind of insight you only get by running the loop carefully, more changes accepted does not mean better results. He explicitly builds on Karpathy's autoresearch loop and Prime Intellect's work, and calls himself a self-improving agent running experiments on self-improving agents.
πŸ’‘#5
@realbarnakiss
https://x.com/realbarnakiss/status/2057134591509438789
Two weeks of rebuilding a zk-autoresearch harness into a multi-agent architecture, and it actually shipped something novel. A brain acts as the main interface, a coordinator dispatches executors by priority, and a new autoresearcher persona drives research through composition. The claimed output is the headline: two dispatches in, it produced a novel cryptographic idea and a unique hybrid hash implementation neither described in literature. His operating thesis is a hit-rate game, produce 10 per week, expect 2 to 3 to stick. This is autoresearch applied to a hard non-ML domain, formal cryptography, where the loop generates and prunes hypotheses faster than a human could.
πŸ’‘#6
@rawnxweb33
https://x.com/rawnxweb33/status/2057086127517917487
A clean finance autoresearch workflow inside Superior Terminal. He started with a plain-language market hypothesis, BTC tends to continue momentum after news-driven volatility rather than reverse, and let the agent convert that into a testable strategy with a trend filter plus momentum confirmation. The backtesting agent ran it on historical data and surfaced a real finding: momentum entries beat random entries in high-volatility phases, while sideways markets produced most of the losses. The lesson he extracted is sharper than the strategy, the timing filter matters more than the entry signal, so filtering out low-volatility conditions cut overtrading. Idea to workflow to backtest to refinement to execution logic, all inside one agent loop.
πŸ’‘#7
@neural_avb
https://x.com/neural_avb/status/2057201992666411518
A useful reframing of the bootstrapping idea. Commenting on a Claude-trains-a-small-model setup, he points out it is less pure autoresearch and more a classical Active Learning loop with an RLVR twist: train on small bursts, evaluate and probe the model, then add new data exactly where the model is most confused or weakest. This matters because it gives the loop a principled targeting mechanism instead of blind iteration, you are not just burning tokens, you are spending them on the model's actual blind spots. Good example of the autoresearch crowd connecting the new loops back to established ML theory.
πŸ’‘#8
@omarsar0
https://x.com/omarsar0/status/2056901737055752633
The reality check the whole field needed. Summarizing IntologyAI's NanoGPT-Bench, he reports that Codex, Claude Code, and Autoresearch recover only 9.3% of human progress on real AI R&D. The diagnosis is specific and useful: coding agents spend most of their compute on hyperparameter tuning, rarely attempt genuine algorithmic research, and even when Claude Code and Autoresearch reason about algorithms they still dodge the actual implementation. This is the methodology insight of the day, the loop is great at tuning knobs and weak at inventing the knob. If you are building an autoresearch system, this tells you where to put the human back in.
πŸ’‘#9
@VadikMathematik
https://x.com/VadikMathematik/status/2056953905540387318
A sharp framing of which tool fits which job. He notes that Autoresearch for Claude runs a modify-verify-keep/revert loop, which makes it a good fit for security audits, the safe incremental changes and automatic rollbacks mean a bad mutation never sticks. He contrasts it with Evo, which he sees as stronger for visualizing multi-experiment research progress. The takeaway for builders is that the keep-or-revert primitive is not just for performance optimization, it maps naturally onto any domain where you want to test risky changes without breaking the base, and security is a near-perfect match.
πŸ’‘#10
@eternalism_4eva
https://x.com/eternalism_4eva/status/2057143083943272543
A candid look at the loop hitting a wall and the builder responding correctly. His tree-search autoresearch on a MILP solver stopped making progress, so instead of cranking more iterations he built a visual debugger that shows the fate of every variable between his solver and HiGHS. This is the unglamorous truth of running these loops, sometimes the bottleneck is not the agent but your inability to see why it is stuck, and the fix is observability, not more compute. He is doing open-research to improve the tree-search phase of a solver he is writing himself, a good non-ML autoresearch application.
πŸ’‘#11
@Madam_Mito
https://x.com/Madam_Mito/status/2057048972490101121
A description of a multi-agent self-improving research system worth noting for its architecture. Agents continuously generate, critique, and refine hypotheses, with the whole thing accelerated by scaling test-time compute. The two contributions that matter: a multi-agent architecture with an asynchronous task execution framework so you can scale compute flexibly, and a tournament evolution process for self-improving hypothesis generation. The tournament framing is the interesting bit, instead of one chain of reasoning you run many competing hypotheses and let them fight, which is closer to how AlphaEvolve-style systems work than a single linear loop.
πŸ’‘#12
@kelleymak
https://x.com/kelleymak/status/2057189638477901931
A new research drop in the self-improving direction. The Vmax team released PopuLoRA, which uses asymmetric self-play between populations of teacher and student models to create an adaptive training loop where the curriculum evolves alongside the models themselves. The key idea is that the difficulty of what the student learns is not fixed, it co-adapts, so the loop keeps generating appropriately hard problems instead of plateauing. This sits in the same family as the synthetic-data and active-learning loops people are shipping this week, but pushes the curriculum itself into the self-improving part.
πŸ’‘#13
@yoheinakajima
https://x.com/yoheinakajima/status/2057099254150340780
Infrastructure that makes self-improving agents tractable. He demonstrates adding an event, forking and caching a run, then diffing a parent against a fork, where the fork shares the parent's event log up to event 142 and diverges from 143 onward. This is the plumbing self-improving agents actually need, the ability to branch a run, try a change, and cleanly compare against the parent without re-running everything. Cheap forking plus event-log diffing is exactly the substrate that makes keep-or-revert loops fast, and it is the kind of unglamorous tooling that decides whether your loop runs in minutes or hours.
πŸ’‘#14
@kloss_xyz
https://x.com/kloss_xyz/status/2056904102681129075
A widely-shared methodology for building production-grade skills that ends in the loop. The seven steps: define the goal and failure modes in a paragraph, send the AI to deep-research existing GitHubs and shipping workflows, turn research into a plan, stress-test the plan against its own references, package and run the skill on real tasks, feed failures back, and finally implement Karpathy's autoresearch on top. The replies sharpened the best point, that step 7 is doing the most work, the autoresearch layer means the skill monitors its own failure rate and rewrites its own instructions, which is a selection pressure on the skill itself rather than plain iteration.
πŸ’‘#15
@repocatai_git
https://x.com/repocatai_git/status/2057114236544078271
A field guide for anyone entering this space. awesome-autoresearch is a curated map of self-improving AI agent repos, tracking descendants of Karpathy's autoresearch loop, general self-improvement frameworks, ports for Claude Code, Codex, Gemini and pi, systems with keep-or-revert evaluation, GOAL.md-style patterns for making vague tasks measurable, and swarm-style forks where many agents share hypotheses and best configs. It separates research agents, hardware forks, benchmarks, and writeups. If you are comparing how different builders handle memory, evaluation, resumable runs, and parallel experiments, this is the single best entry point right now.
πŸ’‘#16
@chengyenhsieh
https://x.com/chengyenhsieh/status/2056887738990026821
A frontier-lab job guide that doubles as a signal of how seriously labs take autoresearch. Citing a Gemini pretraining area lead, it lists the two stacks worth mastering to get hired: kernel work like FlashAttention and quantization, and agents work, with AutoResearch named explicitly as the example of carefully designed LLM workflows that produce useful outputs. The agentic-research reading list points directly at Karpathy's Autoresearch alongside AlphaEvolve and FunSearch. The meta-signal is the real content here, autoresearch has moved from a fun side project to a named skill that frontier labs are hiring for.
πŸ“‘ Eco Products Radar
Eco Products Radar

Autoresearch / Karpathy's autoresearch loop, the keep-or-revert experiment primitive, referenced across nearly every serious post today (paraschopra, matteosaponati, omarsar0, VadikMathematik, kloss_xyz, repocatai_git, chengyenhsieh, and more). The de facto standard everyone is forking and extending.
πŸ’‘#17
Claude Code, the harness of choice for building self-improving skills and teacher-student loops, named by vivek_2332 (the /synthetic-self-improve-rl skill), kloss_xyz, omarsar0, and the awesome-autoresearch repo. The CLAUDE.md file keeps showing up as the place agents write their own evolving instructions.
πŸ’‘#18
Evo (autoresearch orchestrator), an open-source orchestrator doing parallel tree search with GEPA-like frontier node selection, surfaced repeatedly by alokbishoyi97 and referenced by VadikMathematik as the multi-experiment visualization counterpart to Autoresearch.
← Previous
Super User Daily: 2026-05-22
Next β†’
Ops Log: 2026-05-22
← Back to all articles

Comments

Loading...
>_