May 25, 2026loop

Loop Daily: 2026-05-26

A quieter, more honest day for the autoresearch crowd. The loudest single data point cut against the hype: a fresh benchmark says today's coding agents recover under 10 percent of human AI-R&D progress, mostly because they tune hyperparameters and dodge the actual algorithmic work. But underneath that sobering number, the real loops are getting concrete, an agent rewriting train.py and grinding a dozen experiments an hour, a researcher loop that automates jailbreak testing, the first big surveys drawing a hard line between where autonomous research helps and where it quietly breaks. The theme of the day is calibration: less can-it-do-everything, more here-is-exactly-what-it-can-and-cannot-do.
πŸ’‘#1
@rohit4verse
https://x.com/rohit4verse/status/2058841697333948858
The cleanest live autoresearch loop of the day. Instead of a fixed hyperparameter sweep, the agent rewrites arbitrary parts of train.py, so the search space is whatever it can invent and safely implement, not a predefined grid. It clocks roughly 12 runs an hour with a full audit trail written to results.tsv. That last detail matters: an autonomous experiment loop is only trustworthy if every run leaves a paper trail you can reconstruct later.
πŸ’‘#2
@omarsar0
https://x.com/omarsar0/status/2056901737055752633
The reality check the whole field needed. On the NanoGPT-Bench eval, Codex, Claude Code, and Autoresearch recover only 9.3 percent of human progress on real AI R&D. The breakdown is the interesting part: coding agents spend most of their compute on hyperparameter tuning and rarely even attempt algorithmic research, and while Claude Code and Autoresearch reason more about algorithmic ideas, they still dodge the implementation. Self-improving agents are real, but this is the number that says how far the autonomy actually goes right now.
πŸ’‘#3
@neural_avb
https://x.com/neural_avb/status/2057201992666411518
A genuinely self-improving loop applied to training a small language model: bootstrap Claude to do it, but the structure is a classical active-learning loop with an RLVR twist. Train the model on small bursts of data, evaluate and probe it, then add new data exactly where the model is weakest or most confused, and repeat. He credits auto-research as the inspiration but reframes it in active-learning terms, which is a more useful mental model than autonomous magic.
πŸ’‘#4
@HuggingPapers
https://x.com/HuggingPapers/status/2056783143139725339
The first survey to map the whole auto-research landscape: 250-plus papers covering AI across the complete research lifecycle, from idea generation to dissemination. Its key contribution is drawing a sharp boundary between reliable assistance and unreliable autonomy, even as systems now crank out full papers for around 15 dollars. If you want the map before you wander into building your own research loop, this is it.
πŸ’‘#5
@wildmindai
https://x.com/wildmindai/status/2057416041358032938
A companion to the survey: an Awesome AI Auto-Research guide, a comprehensive walkthrough of automating the scientific research lifecycle, paired with a GitHub repo collecting papers and code on agentic AI research. The framing is the shift everyone's circling, moving AI from a simple assistant to an autonomous researcher, with the references to actually go build it.
πŸ’‘#6
@tom_doerr
https://x.com/tom_doerr/status/2058758398854795494
A concrete researcher-agent loop pointed at security: it automates LLM jailbreak experiments. Instead of a human manually probing a model for weaknesses, the loop runs the experiments itself and iterates. It's a small but telling example of autoresearch escaping the ML-training niche into red-teaming, where the experiment-iterate-record cycle maps cleanly onto adversarial testing.
πŸ’‘#7
@Vizzyy_01
https://x.com/Vizzyy_01/status/2058466337702248852
The non-coding angle on agentic loops. Why pay a creative agency to slowly map a three-month, multi-channel distribution strategy, he asks, when you can drop an application link into an agentic loop and have the whole distribution framework live in minutes. It's a marketing-operations framing of the same idea, the loop as a way to compress a multi-week professional deliverable into a single autonomous run.
πŸ’‘#8
@michelleefang
https://x.com/michelleefang/status/2059019530467422365
A signal that autoresearch is becoming its own ecosystem: an Autoresearch Systems Hackathon with Modal, OpenAI, Raindrop and Antler. When the infra players (Modal for compute, OpenAI for models) start co-sponsoring hackathons around the category, it's a sign the loop is moving from individual experiments to a shared tooling stack people are building businesses on.
πŸ“‘ Eco Products Radar
Eco Products Radar

A thin but pointed day. The center of gravity is still Karpathy's autoresearch project as the reference design, with Claude Code and Codex as the agents people actually run the loops on, and Modal showing up as the compute layer underneath the hackathon scene.

Autoresearch (Karpathy's project) β€” the reference design everyone benchmarks and builds against
Claude Code β€” one of the two agents people run their research and experiment loops on
Codex β€” the other default loop runner, measured head-to-head on NanoGPT-Bench
Modal β€” the serverless compute layer surfacing under the autoresearch hackathon stack
← Previous
Super User Daily: 2026-05-26
Next β†’
Ideas Radar: 2026-05-26
← Back to all articles

Comments

Loading...
>_