Loop Daily: 2026-06-06
Autoresearch stopped being Karpathy's demo this week and became a thing people point at everything. The headline cases are almost hard to believe lined up together: a fleet of agents authoring three survey papers that score 8.5 out of 10 on peer review with zero human paragraphs, an open-source agent that rewrites its own weights for a 502 percent jump on a biology task, and a hobbyist with no quantum background beating domain experts on a cryptography benchmark purely through better harness design. The loop is also leaking out of the lab, into cold outbound sales, into hunting DMT vectors inside a model, into a generated video game, into a live ecommerce store. The one refrain underneath all of it is that the loop closes everything except research taste, and that is exactly where the frontier now sits.
#1
@victor207755822
https://x.com/victor207755822/status/2062585403136508400
The Deli AutoResearch project shipped three complete survey papers this week, one brand-new and two updated, with not a single human-written paragraph. The numbers are the story: 941 references across 190 pages, an average peer-review score of 8.5 out of 10, up from 6.0 after fourteen rounds of AI-driven revision, all in about thirty-eight hours of runtime. This is autoresearch pointed straight at scientific writing, and the bottleneck the author names is telling, it is no longer writing quality, it is research taste. The next target is hypothesis generation and novelty detection for fully original work.
https://x.com/victor207755822/status/2062585403136508400
The Deli AutoResearch project shipped three complete survey papers this week, one brand-new and two updated, with not a single human-written paragraph. The numbers are the story: 941 references across 190 pages, an average peer-review score of 8.5 out of 10, up from 6.0 after fourteen rounds of AI-driven revision, all in about thirty-eight hours of runtime. This is autoresearch pointed straight at scientific writing, and the bottleneck the author names is telling, it is no longer writing quality, it is research taste. The next target is hypothesis generation and novelty detection for fully original work.
#2
@AGIHouseSF
https://x.com/AGIHouseSF/status/2062597443745919352
SIA, an open-source agent out of recent research, does something that still sounds like science fiction: it rewrites both its own harness and the underlying model weights. The reported results are not modest, a 56.6 percent gain on LawBench, a 502 percent improvement on single-cell RNA denoising, and a 91.9 percent runtime reduction on GPU kernels. This is the recursive-self-improvement loop with the safety rails off, an agent editing the two things we usually treat as fixed. The crowd is excited enough to throw an emergency hackathon for it.
https://x.com/AGIHouseSF/status/2062597443745919352
SIA, an open-source agent out of recent research, does something that still sounds like science fiction: it rewrites both its own harness and the underlying model weights. The reported results are not modest, a 56.6 percent gain on LawBench, a 502 percent improvement on single-cell RNA denoising, and a 91.9 percent runtime reduction on GPU kernels. This is the recursive-self-improvement loop with the safety rails off, an agent editing the two things we usually treat as fixed. The crowd is excited enough to throw an emergency hackathon for it.
#3
@0xkydo
https://x.com/0xkydo/status/2062565216919908360
Sixty hours after launching the ecdsa.fail challenge, the most surprising result was who topped the board. The biggest single improvement came from someone who knew far less about quantum and elliptic curves than the original author, and whose own autoresearch runs had been plateauing for days. Over a weekend he jumped the benchmark by about fifty percent, with no domain knowledge, purely on the back of a tighter prompt system and a better harness around the agent. Watching harness design beat domain depth in real time is the quietly important lesson of the week.
https://x.com/0xkydo/status/2062565216919908360
Sixty hours after launching the ecdsa.fail challenge, the most surprising result was who topped the board. The biggest single improvement came from someone who knew far less about quantum and elliptic curves than the original author, and whose own autoresearch runs had been plateauing for days. Over a weekend he jumped the benchmark by about fifty percent, with no domain knowledge, purely on the back of a tighter prompt system and a better harness around the agent. Watching harness design beat domain depth in real time is the quietly important lesson of the week.
#4
@dair_ai
https://x.com/dair_ai/status/2062570078705688777
A new benchmark called AutoLab asks a sharp question: can an agent keep improving an artifact for hours under a strict wall-clock budget, the way real research and engineering actually work? It hands seventeen frontier models thirty-six expert-curated tasks, each starting from a correct but deliberately suboptimal baseline. The dominant predictor of success was not the quality of the first attempt, it was persistence, repeatedly benchmarking, editing, and folding in feedback. Claude Opus 4.6 sustained that loop well, while most other models quit early or burned the budget making almost no progress.
https://x.com/dair_ai/status/2062570078705688777
A new benchmark called AutoLab asks a sharp question: can an agent keep improving an artifact for hours under a strict wall-clock budget, the way real research and engineering actually work? It hands seventeen frontier models thirty-six expert-curated tasks, each starting from a correct but deliberately suboptimal baseline. The dominant predictor of success was not the quality of the first attempt, it was persistence, repeatedly benchmarking, editing, and folding in feedback. Claude Opus 4.6 sustained that loop well, while most other models quit early or burned the budget making almost no progress.
#5
@brandon_ai
https://x.com/brandon_ai/status/2062664461660696915
Karpathy shipped autoresearch as a loop that makes an AI system improve itself, fixed metric, one variable, repeat. This builder took that exact loop and ported it to cold outbound sales, calling it AutoGTM and open-sourcing it under MIT. It is a small but important proof that the loop is domain-agnostic, the same mechanism that optimizes a kernel can optimize an email sequence. Any problem with an editable file and a measurable score is fair game.
https://x.com/brandon_ai/status/2062664461660696915
Karpathy shipped autoresearch as a loop that makes an AI system improve itself, fixed metric, one variable, repeat. This builder took that exact loop and ported it to cold outbound sales, calling it AutoGTM and open-sourcing it under MIT. It is a small but important proof that the loop is domain-agnostic, the same mechanism that optimizes a kernel can optimize an email sequence. Any problem with an editable file and a measurable score is fair game.
#6
@pj4533
https://x.com/pj4533/status/2062667492959404454
Here is the strangest application of the week, and the most fun. He is using an autoresearch loop to hill-climb toward the injected activation vector that makes an LLM report the maximum number of DMT-like phenomenological features, running it on Gemma-3-12b. It sounds like a party trick, but it is real interpretability work, using the loop to search a model's activation space for a target behavior. The point that lands is that the loop turns weird, hard-to-specify research questions into something you can just optimize toward.
https://x.com/pj4533/status/2062667492959404454
Here is the strangest application of the week, and the most fun. He is using an autoresearch loop to hill-climb toward the injected activation vector that makes an LLM report the maximum number of DMT-like phenomenological features, running it on Gemma-3-12b. It sounds like a party trick, but it is real interpretability work, using the loop to search a model's activation space for a target behavior. The point that lands is that the loop turns weird, hard-to-specify research questions into something you can just optimize toward.
#7
@sambarrowclough
https://x.com/sambarrowclough/status/2062588293905084787
Seven months into a project, he finally shipped it, and one of the things they did was run a flavour of Karpathy's autoresearch to improve concrete product dimensions: answer correctness, course-creation time, and removing duplicate questions. This is the un-glamorous, real version of the loop, not chasing a SOTA paper but tuning the metrics of a shipping education product. It is exactly the kind of grunt-work optimization the labs say agents can do perfectly, applied in the wild.
https://x.com/sambarrowclough/status/2062588293905084787
Seven months into a project, he finally shipped it, and one of the things they did was run a flavour of Karpathy's autoresearch to improve concrete product dimensions: answer correctness, course-creation time, and removing duplicate questions. This is the un-glamorous, real version of the loop, not chasing a SOTA paper but tuning the metrics of a shipping education product. It is exactly the kind of grunt-work optimization the labs say agents can do perfectly, applied in the wild.
#8
@matteosaponati
https://x.com/matteosaponati/status/2062540779977924706
He is running a disciplined personal program: experimenting with coding agents for autoresearch loops, running batches of experiments every week and documenting the results as he goes. This week's stress test is devious, he puts the agents in environments where the evaluations always return random Gaussian noise no matter what the agent does. It is a clever probe of whether an autoresearch agent can tell signal from luck, which is exactly the failure mode that wrecks naive optimization loops.
https://x.com/matteosaponati/status/2062540779977924706
He is running a disciplined personal program: experimenting with coding agents for autoresearch loops, running batches of experiments every week and documenting the results as he goes. This week's stress test is devious, he puts the agents in environments where the evaluations always return random Gaussian noise no matter what the agent does. It is a clever probe of whether an autoresearch agent can tell signal from luck, which is exactly the failure mode that wrecks naive optimization loops.
#9
@DanKornas
https://x.com/DanKornas/status/2062587935606911137
Prompt jailbreak experiments get messy fast, so he turned them into a loop. Jailbreak Autoresearch is a small autoresearch harness for prompt experiments, with separate target, researcher and scorer models, that compares header and footer harnesses against one fixed test body and scores each response against a rubric, saving the full experiment trail in SQLite. It runs baseline, seeded, evolve-best and recombine strategies and permutes model roles, all MIT-licensed. It is a clean template for turning any fuzzy prompt-tuning task into a reproducible search.
https://x.com/DanKornas/status/2062587935606911137
Prompt jailbreak experiments get messy fast, so he turned them into a loop. Jailbreak Autoresearch is a small autoresearch harness for prompt experiments, with separate target, researcher and scorer models, that compares header and footer harnesses against one fixed test body and scores each response against a rubric, saving the full experiment trail in SQLite. It runs baseline, seeded, evolve-best and recombine strategies and permutes model roles, all MIT-licensed. It is a clean template for turning any fuzzy prompt-tuning task into a reproducible search.
#10
@gauthampai
https://x.com/gauthampai/status/2062642566978478181
He makes the case for building a prompt-to-DAG workflow generator yourself: take a prompt, turn it into a workflow with a clean separation between deterministic and stochastic stages, typed inputs and outputs, step-by-step control and on-the-fly improvement. To demonstrate, he built a DAG workflow for Karpathy's autoresearch project, where the blue stages skip the LLM entirely and the orange ones are the stochastic calls. The insight is that the orchestration around the loop matters as much as the loop, structure is where reliability comes from.
https://x.com/gauthampai/status/2062642566978478181
He makes the case for building a prompt-to-DAG workflow generator yourself: take a prompt, turn it into a workflow with a clean separation between deterministic and stochastic stages, typed inputs and outputs, step-by-step control and on-the-fly improvement. To demonstrate, he built a DAG workflow for Karpathy's autoresearch project, where the blue stages skip the LLM entirely and the orange ones are the stochastic calls. The insight is that the orchestration around the loop matters as much as the loop, structure is where reliability comes from.
#11
@ModernGrindTech
https://x.com/ModernGrindTech/status/2062675020803916234
He cuts to the heart of self-improving agents: with 3,900 skills in his own agent repo, the inflection point was not the quantity of skills. It was the moment the loop started writing new skills from its own session feedback, overnight, without him. That is the line between a static toolbox and a system that compounds, the agent noticing what it struggled with and authoring the fix while you sleep. This is the part that actually scales.
https://x.com/ModernGrindTech/status/2062675020803916234
He cuts to the heart of self-improving agents: with 3,900 skills in his own agent repo, the inflection point was not the quantity of skills. It was the moment the loop started writing new skills from its own session feedback, overnight, without him. That is the line between a static toolbox and a system that compounds, the agent noticing what it struggled with and authoring the fix while you sleep. This is the part that actually scales.
#12
@LeoYu926
https://x.com/LeoYu926/status/2062420061537886664
An operator running AI agents on live ecommerce, Shopee Thailand plus Pinterest and Facebook, confirms what the researchers keep saying: the agent loop is the easy part. Ninety percent of his time goes into the harness, what the agent can't touch, how context carries across sessions, and which rules need the why spelled out so the agent doesn't reason its way around them. The part everyone skips is session persistence, because every new session starts blank and someone has to build the bridge. This is the unsexy reality of running loops in production.
https://x.com/LeoYu926/status/2062420061537886664
An operator running AI agents on live ecommerce, Shopee Thailand plus Pinterest and Facebook, confirms what the researchers keep saying: the agent loop is the easy part. Ninety percent of his time goes into the harness, what the agent can't touch, how context carries across sessions, and which rules need the why spelled out so the agent doesn't reason its way around them. The part everyone skips is session persistence, because every new session starts blank and someone has to build the bridge. This is the unsexy reality of running loops in production.
#13
@QuchengG
https://x.com/QuchengG/status/2062368462497042813
He built Gongent, a builder-and-adversary agent loop, and set a new state of the art on ProgramBench: three perfect 100 percent reconstructions of black-box CLIs, where every prior public entry, including the top-ranked gpt-5.5-xhigh, managed only one. The builder is a vanilla mini-swe-agent with zero per-task tuning, so all the lift comes from the loop itself, an adversary synthesizes thousands of tests grounded in the gold binary, then a byte-exact compare-and-fix loop iterates until convergence. It is a clean demonstration that the loop, not the base prompt, is where the performance lives.
https://x.com/QuchengG/status/2062368462497042813
He built Gongent, a builder-and-adversary agent loop, and set a new state of the art on ProgramBench: three perfect 100 percent reconstructions of black-box CLIs, where every prior public entry, including the top-ranked gpt-5.5-xhigh, managed only one. The builder is a vanilla mini-swe-agent with zero per-task tuning, so all the lift comes from the loop itself, an adversary synthesizes thousands of tests grounded in the gold binary, then a byte-exact compare-and-fix loop iterates until convergence. It is a clean demonstration that the loop, not the base prompt, is where the performance lives.
#14
@willemhelmet
https://x.com/willemhelmet/status/2062557704313352699
He made a video game out of world models. Inspired by an article on the technique, he built his own World Model Harness using LingBot-World for real-time frame generation paired with a VLM, creating an agentic loop where the user actually interacts with the generated environment as it is produced. This is the loop applied somewhere completely outside research or coding, a playable, generated world held together by an agent observing and acting frame by frame. It is a glimpse of how generative agents leak into entertainment.
https://x.com/willemhelmet/status/2062557704313352699
He made a video game out of world models. Inspired by an article on the technique, he built his own World Model Harness using LingBot-World for real-time frame generation paired with a VLM, creating an agentic loop where the user actually interacts with the generated environment as it is produced. This is the loop applied somewhere completely outside research or coding, a playable, generated world held together by an agent observing and acting frame by frame. It is a glimpse of how generative agents leak into entertainment.
#15
@nathancgy4
https://x.com/nathancgy4/status/2062621453892378860
His first vibe-check for any new LLM is now a long prompt asking it to come up with model-architecture ideas, because that surfaces the model's taste most directly at a time when most coding tasks no longer separate one model from another. He is bullish on more open-ended evals like this, and names autoresearch and iterative kernel optimization as the two best tasks of this shape. His nuance is worth keeping: autoresearch excites him precisely because its most fundamental component bottoms out in raw model intelligence, which means good pretraining stays vital. Right now it feels more like a benchmark than a tool.
https://x.com/nathancgy4/status/2062621453892378860
His first vibe-check for any new LLM is now a long prompt asking it to come up with model-architecture ideas, because that surfaces the model's taste most directly at a time when most coding tasks no longer separate one model from another. He is bullish on more open-ended evals like this, and names autoresearch and iterative kernel optimization as the two best tasks of this shape. His nuance is worth keeping: autoresearch excites him precisely because its most fundamental component bottoms out in raw model intelligence, which means good pretraining stays vital. Right now it feels more like a benchmark than a tool.
#16
@MOkradze
https://x.com/MOkradze/status/2062520033465798823
A short but sharp design rule for self-improving systems: let agents learn repeated work, but make the learning reviewable before it changes future runs. Self-improving tools are useful, he writes, while silent self-modifying tools are how you get weird failures. As more people wire up loops that rewrite their own skills overnight, this is the guardrail that keeps the compounding from quietly going off the rails, a human-readable diff between what the agent learned and what it is allowed to keep.
https://x.com/MOkradze/status/2062520033465798823
A short but sharp design rule for self-improving systems: let agents learn repeated work, but make the learning reviewable before it changes future runs. Self-improving tools are useful, he writes, while silent self-modifying tools are how you get weird failures. As more people wire up loops that rewrite their own skills overnight, this is the guardrail that keeps the compounding from quietly going off the rails, a human-readable diff between what the agent learned and what it is allowed to keep.
π‘ Eco Products Radar
Eco Products Radar
autoresearch (Karpathy) β the framework everything else this week is built on, forked, or ported from; cited in nearly every case.
Deli AutoResearch β the open-source skill set behind the LLM-authored survey papers; the most concrete evidence that autoresearch can produce publishable-grade output.
EVO β Alok Bishoyi's open-source autoresearch orchestrator that runs parallel experiments and keeps only gated, metric-improving changes; integrates with Claude Code and Cursor.
Hermes Agent β Nous Research's self-improving local agent; the most-named vehicle for the overnight skill-writing loop people keep describing.
autoresearch (Karpathy) β the framework everything else this week is built on, forked, or ported from; cited in nearly every case.
Deli AutoResearch β the open-source skill set behind the LLM-authored survey papers; the most concrete evidence that autoresearch can produce publishable-grade output.
EVO β Alok Bishoyi's open-source autoresearch orchestrator that runs parallel experiments and keeps only gated, metric-improving changes; integrates with Claude Code and Cursor.
Hermes Agent β Nous Research's self-improving local agent; the most-named vehicle for the overnight skill-writing loop people keep describing.
Comments