May 26, 2026loop

Loop Daily: May 27, 2026

Karpathy tweeted "autoresearch" and a market fell out of the sky. This day made it obvious the idea has stopped being a clever demo and started becoming infrastructure, with startups, a benchmark war, and a scramble for the name. But the more interesting signal is where the loop is escaping to. It's leaving ML training and showing up in robotics navigation, endometriosis research, software shipping, and skill-optimization. The common shape underneath all of it is the same: take any problem that has an editable file and a measurable score, wrap it in a build-evaluate-keep-or-revert loop, and let it run. Here's who pushed that idea forward today.
πŸ’‘#1
@yacinelearning
https://x.com/yacinelearning/status/2058943549521985892
The most-watched autoresearch conversation of the day, a 90-minute deep dive with the Paradigma team on the infrastructure underneath auto-research. The core thesis is that the DAG is the right substrate for autonomous research: it's how you let agents run experiments, share context between them, and build large public research graphs without the whole thing degenerating into "bad DAGs." This is the part everyone hand-waves past, the boring plumbing of how an agent keeps track of thousands of experiments, and it's exactly what determines whether autoresearch works at scale or collapses under its own context.
πŸ’‘#2
@JoseCSancho
https://x.com/JoseCSancho/status/2059005860790055163
The sharpest market read of the day. He points out Karpathy's autoresearch repo has ~80k stars and ~11.7k forks yet nobody is fighting for the commercial brand, while Shopify already proved the playbook internally (53% Liquid speedup, 300x unit-test wins, CI builds down 65%) and Google published a $2/hour Cloud Run how-to. His argument: the open gap is the Karpathy loop applied to non-ML domains, RAG retrieval, prompt suites, build-time optimization, trading research, and the 12-month window before the big agent labs absorb it. Whether or not you act on it, it's the clearest framing of where the value sits.
πŸ’‘#3
@Montreal_AI
https://x.com/Montreal_AI/status/2058731326186852382
Points to an ICML 2026 paper, Self-play SWE-RL, that is the loop at its most pure. A coding agent generates its own curriculum in sandboxed repos with zero human-labeled issues: one LLM policy plays bug injector, another plays bug solver, and every failed repair becomes a harder bug that evolves the curriculum. The results are the headline, +10.4 on SWE-bench Verified and +7.8 on SWE-Bench Pro, beating the human-data RL baseline. This is AlphaZero's self-play, pointed at software engineering, and it says the data bottleneck might just be a choice.
πŸ’‘#4
@SatyKrish
https://x.com/SatyKrish/status/2059024373734924448
A clean, fully-specified self-improvement loop built on Karpathy's repo, aimed at skill files. It runs a SKILL.md against fixed training sections, scores the output with an LLM-judge subagent against a per-skill rubric, then keeps or reverts the change based on the aggregate score, iterating on a dedicated git branch that doubles as the experiment log. The keep-or-revert-on-a-branch mechanic is the whole trick, it's version control as the substrate for automated self-improvement. He notes SkillOpt does something similar.
πŸ’‘#5
@AgnOps
https://x.com/AgnOps/status/2058956568566042791
The hard numbers behind the skill-optimization loop. SkillOpt from Microsoft Research was run across 6 benchmarks, 7 target models, and 3 execution harnesses, 52 cells in total, and was best or tied on every single one. Concretely: +23.5 points for GPT-5.5 in direct chat, +24.8 inside a Codex agentic loop, +19.1 inside Claude Code, beating human-written skills, one-shot LLM, TextGrad, GEPA and EvoSkill per cell. When an automated loop beats human experts on every cell of a 52-cell grid, the "humans write the skills" era is on notice.
πŸ’‘#6
@alokbishoyi97
https://x.com/alokbishoyi97/status/2058933449508241547
Building evo, an open autoresearch orchestrator that plugs onto any repo, auto-discovers metrics worth optimizing, and runs the loop in parallel. The part that matters for real use: it sets up gates so the agents can't introduce unintended consequences, plugs into whatever agent you already use, and distributes tasks across your cloud infra or runs locally. The gates are the adult supervision autoresearch has been missing, the thing that turns "let it run overnight" from reckless into deployable.
πŸ’‘#7
@vivekchand19
https://x.com/vivekchand19/status/2059037833403511235
Introduces FLYWHEEL.md, an MIT-licensed single file that takes Karpathy's overnight-experiment loop and points it at shipping real software, with humans kept at the gates that matter. He frames an emerging canon of agent files: AGENTS.md (what to do), SOUL.md (who to be), FLYWHEEL.md (how to ship and how to know you did). Each pipeline stage declares "done when ___" and whether the agent proceeds or waits for a human. It's a clean attempt to make the autonomous loop auditable, one file that describes how the whole wheel turns.
πŸ’‘#8
@lesh_bla
https://x.com/lesh_bla/status/2058924147158421513
The cleanest non-ML use case of the day. He wrote a basic navigation algorithm (loop closure, relocalization), established ground truth for scoring with AprilTags, then ran autoresearch until the score was perfect. This is the template stripped to its bones: a problem in code, an objective verifier, and a loop left to grind. Robotics perception, tuned by an agent overnight instead of by a grad student over a month.
πŸ’‘#9
@romir_jain
https://x.com/romir_jain/status/2058871287989379113
A concrete, cheap optimization loop from Jina AI's Han Xiao: Claude Opus 4 writes Python programs over a frozen embedding API, a harness scores each one, and a long-horizon memory tracks what's already been tried. The numbers are the point, 259 programs across 90 generations for a total cost of about $30. That's the whole pitch for autoresearch in one line, ninety generations of machine-driven iteration for the price of lunch.
πŸ’‘#10
@filipwojda
https://x.com/filipwojda/status/2058999632655429674
Short but a real signal of spread: he's running auto-research over endometriosis. The loop is leaving code and ML entirely and showing up in medical research, pointed at a condition that's notoriously under-studied. This is the future the optimists keep describing, anyone with a measurable question and a corpus can spin up a tireless research agent, no lab required.
πŸ’‘#11
@morgymcg
https://x.com/morgymcg/status/2058928240106914295
Hard-won lessons from actually building an autoresearch agent, matching what an auto-kernel-generation report found independently: clear experiment management is critical, and you must avoid dragging research and bug-fixes back into the main context. His sharpest insight is that adding sub-agents to fill a specific observed reasoning gap, in this case the agent never inspecting the input data, beats adding generic capacity. The failures of these loops are mostly context-management failures, not intelligence failures.
πŸ’‘#12
@samrexford
https://x.com/samrexford/status/2058737184447091168
Shipped /autodev, a Claude skill that implements the autoresearch loop for general development: a continuous build-evaluate-iterate cycle where the agent writes code, runs local tests, commits, and keeps going until it hits an explicit definition of done. It's a small thing, but it's the loop packaged as a one-command skill anyone can drop in, which is how an idea goes from Karpathy's repo to everyone's terminal.
πŸ’‘#13
@mourginakis
https://x.com/mourginakis/status/2058768439892934828
A precise reward-shaping idea for anyone running these loops: agents often produce unreadable, contrived solutions, so feed deterministic metrics from a linter like ruff back into the autoresearch loop, or add something like log(lines-of-code) to the loss to penalize bloat. It's a one-sentence fix to a real failure mode, the loop optimizes exactly what you measure, so if you only measure correctness you get correct slop. Measure readability too.
πŸ’‘#14
@Moustafa_Awad
https://x.com/Moustafa_Awad/status/2058932854814384419
A tight description of the practical agent loop that actually holds up in production: do the work, turn the repeatable parts into skills, add evals and integration tests, then let cron handle the check-backs. His point cuts against the planning fetish, reliability comes from the boring loop wrapped around the agent, not from a smarter plan up front. This is the unglamorous version of autoresearch that most people will actually run.
πŸ“‘ Eco Products Radar
Eco Products Radar

Karpathy's autoresearch repo (~80k stars, ~11.7k forks) is the gravitational center of the entire conversation, cited as the inspiration or base for nearly every project here, from skill-improvement loops to FLYWHEEL.md to overnight model experiments.

evo (@alokbishoyi97) is the open orchestrator gaining the most traction, plug autoresearch onto any repo, auto-suggest metrics, run loops in parallel with gates, distribute across cloud or local. The "make autoresearch deployable for normal people" play.

Paradigma is the infrastructure bet, treating the DAG as the unit of autonomous research and building the experiment-tracking substrate the rest of the field is hand-waving past.

SkillOpt (Microsoft Research) is the benchmark heavyweight, an automated skill-optimization loop that was best-or-tied across all 52 model-benchmark-harness cells, beating human-written skills and prior methods like TextGrad and GEPA.

FLYWHEEL.md (@vivekchand19) is the new artifact to watch, a single MIT-licensed file that makes the autonomous ship-loop auditable, joining AGENTS.md and SOUL.md in the emerging agent-repo canon.
← Previous
Super User Daily: May 27, 2026
Next β†’
Ideas Radar: May 27, 2026
← Back to all articles

Comments

Loading...
>_