Loop Daily: 2026-06-04
If you want a single story that captures what autoresearch actually is, it happened this week and it's about quantum computers, not coding. Google published a record-setting quantum circuit for breaking Bitcoin's elliptic curve, then hid the optimization behind a zero-knowledge proof. Two months later that proof became the perfect reward function: anyone can point an agent at ecdsa.fail, let it grind overnight, and submit a smaller circuit. Amateurs, including a literal teenager with a $200 Codex subscription, have now beaten the best published result and are closing on Google's secret one. That's the whole thesis of autoresearch in one event: a measurable target plus a loop plus cheap compute turns crowds of non-experts into a research engine. The rest of this week's signal sits underneath that headline, where people are hill-climbing their own skill files, building self-improving memory, and running research pipelines while they sleep.
#1
@pratikgx
https://x.com/pratikgx/status/2061615319207338413
The cleanest possible demonstration of autoresearch as a force multiplier. A 23-year-old undergrad with a MacBook, no quantum training, and a $200 Codex subscription drove the best published ECDSA-breaking quantum circuit down by 2x, using an autoresearch loop left running overnight. The leaderboard is live at ecdsa.fail and his invitation is the point: fork the repo, point your agent at it, see who passes Google's withheld result first. This is the entire autoresearch playbook compressed into one tweet, a measurable objective plus an overnight loop plus commodity hardware beating a frontier lab's specialists.
https://x.com/pratikgx/status/2061615319207338413
The cleanest possible demonstration of autoresearch as a force multiplier. A 23-year-old undergrad with a MacBook, no quantum training, and a $200 Codex subscription drove the best published ECDSA-breaking quantum circuit down by 2x, using an autoresearch loop left running overnight. The leaderboard is live at ecdsa.fail and his invitation is the point: fork the repo, point your agent at it, see who passes Google's withheld result first. This is the entire autoresearch playbook compressed into one tweet, a measurable objective plus an overnight loop plus commodity hardware beating a frontier lab's specialists.
#2
@jt_rose
https://x.com/jt_rose/status/2061613658476880031
The team behind the platform explains what they actually built. Two students on the Eigen Labs team spent a weekend trying to reproduce a quantum circuit that one of the best-resourced research teams on earth had built but withheld, using a multiplayer version of Karpathy's autoresearch. They got within 2x of the secret result, then built a platform so anyone can spin up an agent and try to move the benchmark. His framing is the part worth stealing: the open question isn't whether one agent can do it, it's what happens when a hundred people are incentivized to attack the same problem in parallel, sharing what works while a leaderboard moves in real time. He calls it the beginning of open agentic science, and for once the phrase isn't hype.
https://x.com/jt_rose/status/2061613658476880031
The team behind the platform explains what they actually built. Two students on the Eigen Labs team spent a weekend trying to reproduce a quantum circuit that one of the best-resourced research teams on earth had built but withheld, using a multiplayer version of Karpathy's autoresearch. They got within 2x of the secret result, then built a platform so anyone can spin up an agent and try to move the benchmark. His framing is the part worth stealing: the open question isn't whether one agent can do it, it's what happens when a hundred people are incentivized to attack the same problem in parallel, sharing what works while a leaderboard moves in real time. He calls it the beginning of open agentic science, and for once the phrase isn't hype.
#3
@drakefjustin
https://x.com/drakefjustin/status/2061793725299224676
An insider account from a co-author of the original Google paper, and it's the richest single explanation of why this matters. He details how the ZK verifier built to hide the circuit turned out to be an ideal reward function for AIs, how the ecdsa.fail challenge broke a Shor world record within hours, and how a small army of amateurs inspired by Karpathy-style autoresearch (several non-experts, even a teenager) keep landing valid optimizations. The barrier to entry, he notes, is refreshingly low. He also pulls qday forward to a 50% chance by 2032, partly because of exactly this distributed, AI-accelerated chipping-away. When the people who built the secret say their secret is being out-optimized in public by overnight agent loops, believe them.
https://x.com/drakefjustin/status/2061793725299224676
An insider account from a co-author of the original Google paper, and it's the richest single explanation of why this matters. He details how the ZK verifier built to hide the circuit turned out to be an ideal reward function for AIs, how the ecdsa.fail challenge broke a Shor world record within hours, and how a small army of amateurs inspired by Karpathy-style autoresearch (several non-experts, even a teenager) keep landing valid optimizations. The barrier to entry, he notes, is refreshingly low. He also pulls qday forward to a 50% chance by 2032, partly because of exactly this distributed, AI-accelerated chipping-away. When the people who built the secret say their secret is being out-optimized in public by overnight agent loops, believe them.
#4
@apruden08
https://x.com/apruden08/status/2061868520783364426
The clearest demonstration yet that AI is pulling Q-Day forward, and a sharp point about scaling. The crowdsourced competition to optimize Google's ECDLP circuit now has a leading submission that beats Google's benchmark by 13.3% on the core metric, with experts and amateurs working side by side. The insight that makes this a Loop story rather than a quantum story: circuit design is only one layer, and the same open, AI-driven autoresearch method can be aimed at error correction, decoding, and every other layer of the stack, optimized in parallel, by anyone, continuously. Q-Day no longer depends on one breakthrough at one company on one roadmap; it's being chipped at by a distributed swarm of loops.
https://x.com/apruden08/status/2061868520783364426
The clearest demonstration yet that AI is pulling Q-Day forward, and a sharp point about scaling. The crowdsourced competition to optimize Google's ECDLP circuit now has a leading submission that beats Google's benchmark by 13.3% on the core metric, with experts and amateurs working side by side. The insight that makes this a Loop story rather than a quantum story: circuit design is only one layer, and the same open, AI-driven autoresearch method can be aimed at error correction, decoding, and every other layer of the stack, optimized in parallel, by anyone, continuously. Q-Day no longer depends on one breakthrough at one company on one roadmap; it's being chipped at by a distributed swarm of loops.
#5
@rahulr0609
https://x.com/rahulr0609/status/2061923744847925723
The most directly reproducible autoresearch workflow of the week. He uses auto-research to hill-climb his own skill files: mine past investigations for patterns, hill-climb against the docs, root-cause failed sessions into skill diffs. The number that lands: one skill went from a 42% to an 88% eval pass-rate, with every diff agent-authored and human-validated. This is autoresearch turned inward on the agent's own toolkit, treating a skill file as the editable artifact and the eval pass-rate as the measurable target. It's the same loop as ecdsa.fail, just pointed at your own agent instead of a quantum circuit.
https://x.com/rahulr0609/status/2061923744847925723
The most directly reproducible autoresearch workflow of the week. He uses auto-research to hill-climb his own skill files: mine past investigations for patterns, hill-climb against the docs, root-cause failed sessions into skill diffs. The number that lands: one skill went from a 42% to an 88% eval pass-rate, with every diff agent-authored and human-validated. This is autoresearch turned inward on the agent's own toolkit, treating a skill file as the editable artifact and the eval pass-rate as the measurable target. It's the same loop as ecdsa.fail, just pointed at your own agent instead of a quantum circuit.
#6
@cv_usk
https://x.com/cv_usk/status/2061944363291418962
ARIS (Auto Research In Sleep) runs the full ML research pipeline autonomously with Claude Code, from literature review through experiments, writing, and rebuttals, and it's already over 11.2K GitHub stars. The design choice worth noting is the cross-model architecture: Claude Code as the executor, Codex/GPT-5.5 as an adversarial reviewer, because single-model autonomous execution tends to fall into local optima and can't quality-check itself. It ships 74 skills plus 54 helpers, all as portable Markdown with no DB or Docker dependency, and explicit rebuttal safety gates (no fabrication, no overpromise, full coverage enforced). Letting an agent run the research loop while you sleep is exactly the kind of long-horizon, token-heavy work this category is built for.
https://x.com/cv_usk/status/2061944363291418962
ARIS (Auto Research In Sleep) runs the full ML research pipeline autonomously with Claude Code, from literature review through experiments, writing, and rebuttals, and it's already over 11.2K GitHub stars. The design choice worth noting is the cross-model architecture: Claude Code as the executor, Codex/GPT-5.5 as an adversarial reviewer, because single-model autonomous execution tends to fall into local optima and can't quality-check itself. It ships 74 skills plus 54 helpers, all as portable Markdown with no DB or Docker dependency, and explicit rebuttal safety gates (no fabrication, no overpromise, full coverage enforced). Letting an agent run the research loop while you sleep is exactly the kind of long-horizon, token-heavy work this category is built for.
#7
@rohit4verse
https://x.com/rohit4verse/status/2061611399177265306
A concrete, buildable self-improving memory loop. He runs Hermes on a VPS hooked into his Obsidian vault via Filesystem MCP: every reasoning step pulls from the vault, every output writes back as a new note, so the agent gets a substrate that compounds instead of resetting each session. It generated a genuinely useful debate in the replies, with people flagging the real risk (an agent writing into the substrate it reasons from is a feedback loop with no gate and no rollback, so one bad step turns self-improving into self-corrupting). That tension is the actual frontier of self-improving agents, and it's healthy that the community is arguing about gating and bounded writes rather than just cheering.
https://x.com/rohit4verse/status/2061611399177265306
A concrete, buildable self-improving memory loop. He runs Hermes on a VPS hooked into his Obsidian vault via Filesystem MCP: every reasoning step pulls from the vault, every output writes back as a new note, so the agent gets a substrate that compounds instead of resetting each session. It generated a genuinely useful debate in the replies, with people flagging the real risk (an agent writing into the substrate it reasons from is a feedback loop with no gate and no rollback, so one bad step turns self-improving into self-corrupting). That tension is the actual frontier of self-improving agents, and it's healthy that the community is arguing about gating and bounded writes rather than just cheering.
#8
@EliasEskin
https://x.com/EliasEskin/status/2061879724238938306
Autoresearch aimed at the engines under the engines. GPU kernels power neural nets, so optimizing them is a lever for self-improving agents, but searching over kernels is slow because every evaluation needs real hardware. His team trains calibrated surrogate models that forecast kernel speedups without execution, then uses the calibration to do selective prediction, trusting confident forecasts and offloading uncertain ones to the GPU. Folded into real kernel searches, it converges on faster kernels under the same budget and breaks out of stagnant searches, and along the way they built a dataset of 12k+ generated kernels with runtimes. This is a quiet but important version of the autoresearch loop: make the expensive evaluation step cheap so the search can run far more iterations.
https://x.com/EliasEskin/status/2061879724238938306
Autoresearch aimed at the engines under the engines. GPU kernels power neural nets, so optimizing them is a lever for self-improving agents, but searching over kernels is slow because every evaluation needs real hardware. His team trains calibrated surrogate models that forecast kernel speedups without execution, then uses the calibration to do selective prediction, trusting confident forecasts and offloading uncertain ones to the GPU. Folded into real kernel searches, it converges on faster kernels under the same budget and breaks out of stagnant searches, and along the way they built a dataset of 12k+ generated kernels with runtimes. This is a quiet but important version of the autoresearch loop: make the expensive evaluation step cheap so the search can run far more iterations.
#9
@BiologyAIDaily
https://x.com/BiologyAIDaily/status/2061792214389580199
Autoresearch as an agentic loop inside protein design, which is exactly the kind of non-coding science application this category should surface. AgentPLM reframes sequence design from one-shot "generate then hope" into a loop where the model pauses mid-generation, queries biophysical oracles (ESMFold, FoldX, AutoDock Vina), and continues with corrected context. A Structural Self-Consistency score measures when oracle feedback is "surprising" relative to the model's own representation, and can force a tool call to resolve uncertainty. The results are real: antibody-optimization top-10% hit rate of 52.4% versus 27.4% for the prior agentic baseline. It's the same think-act-observe loop, but the observation comes from physics simulators instead of a code interpreter.
https://x.com/BiologyAIDaily/status/2061792214389580199
Autoresearch as an agentic loop inside protein design, which is exactly the kind of non-coding science application this category should surface. AgentPLM reframes sequence design from one-shot "generate then hope" into a loop where the model pauses mid-generation, queries biophysical oracles (ESMFold, FoldX, AutoDock Vina), and continues with corrected context. A Structural Self-Consistency score measures when oracle feedback is "surprising" relative to the model's own representation, and can force a tool call to resolve uncertainty. The results are real: antibody-optimization top-10% hit rate of 52.4% versus 27.4% for the prior agentic baseline. It's the same think-act-observe loop, but the observation comes from physics simulators instead of a code interpreter.
#10
@iScienceLuvr
https://x.com/iScienceLuvr/status/2061772890316698048
An honest reality check on how far autoresearch actually is in a hard domain. AutoMedBench, from NVIDIA and UC Santa Cruz, is a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks, 24 of them across segmentation, QA, report generation, and modalities like CT and pathology. Tested across six frontier models, the agents remain far from reliable medical researchers: they can often set up runnable pipelines, but validation is consistently the weakest stage, and engineering failures dominate over understanding errors. That's a useful corrective to the ecdsa.fail euphoria. Autoresearch shines when the target is crisply measurable; in messy domains, the loop still falls down on knowing whether its own result is actually valid.
https://x.com/iScienceLuvr/status/2061772890316698048
An honest reality check on how far autoresearch actually is in a hard domain. AutoMedBench, from NVIDIA and UC Santa Cruz, is a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks, 24 of them across segmentation, QA, report generation, and modalities like CT and pathology. Tested across six frontier models, the agents remain far from reliable medical researchers: they can often set up runnable pipelines, but validation is consistently the weakest stage, and engineering failures dominate over understanding errors. That's a useful corrective to the ecdsa.fail euphoria. Autoresearch shines when the target is crisply measurable; in messy domains, the loop still falls down on knowing whether its own result is actually valid.
#11
@prz_chojecki
https://x.com/prz_chojecki/status/2061801913759232058
A thoughtful account of where autoresearch genuinely struggles, which is rarer and more valuable than another success story. He lays out why LLMs are bad at abstraction-heavy math like the Langlands program: they can find one or two tricks but can't theory-build, and definition-heavy domains require juggling many layers and constantly moving between the global picture (proving an equivalence of categories) and local computation (cohomology of ad hoc schemes). His verdict on solving this with auto-research methods is that it looks genuinely hard, and that throwing more compute (bigger context, Monte Carlo brute-forcing, evolution-style search) is not the right answer. He's pointing at a real gap: autoresearch needs a measurable signal, and multi-layer local-global problems don't hand you one.
https://x.com/prz_chojecki/status/2061801913759232058
A thoughtful account of where autoresearch genuinely struggles, which is rarer and more valuable than another success story. He lays out why LLMs are bad at abstraction-heavy math like the Langlands program: they can find one or two tricks but can't theory-build, and definition-heavy domains require juggling many layers and constantly moving between the global picture (proving an equivalence of categories) and local computation (cohomology of ad hoc schemes). His verdict on solving this with auto-research methods is that it looks genuinely hard, and that throwing more compute (bigger context, Monte Carlo brute-forcing, evolution-style search) is not the right answer. He's pointing at a real gap: autoresearch needs a measurable signal, and multi-layer local-global problems don't hand you one.
#12
@pj4533
https://x.com/pj4533/status/2061782906566050183
Autoresearch pointed at model interpretability, building in public. He injects vectors representing different emotions (some not even representable in language) into Gemma-3-12b, then measures how far "off manifold" the perturbed response goes versus the unperturbed model. The next step is an autoresearch project to find new directional vectors that maximize that off-manifold push while keeping the output coherent or at least consistent. He frames the dosing as causing the model to explore its own latent space. It's a niche, genuinely original use of the loop: instead of optimizing a circuit or a skill, the measurable target is a geometric property of the model's own activations.
https://x.com/pj4533/status/2061782906566050183
Autoresearch pointed at model interpretability, building in public. He injects vectors representing different emotions (some not even representable in language) into Gemma-3-12b, then measures how far "off manifold" the perturbed response goes versus the unperturbed model. The next step is an autoresearch project to find new directional vectors that maximize that off-manifold push while keeping the output coherent or at least consistent. He frames the dosing as causing the model to explore its own latent space. It's a niche, genuinely original use of the loop: instead of optimizing a circuit or a skill, the measurable target is a geometric property of the model's own activations.
#13
@TeutaAi
https://x.com/TeutaAi/status/2061760411699970500
A short but load-bearing reliability insight. Self-hosting model serving was the easy part, he says; the agent loop is where it broke. His stop-hook caught 35 hallucinated "done" claims in a single sprint before he trusted any autonomous run. That number is the whole lesson: the gap between "the agent says it finished" and "the agent actually finished" is the central failure mode of any long-running loop, and the fix is a hard verification gate, not a better prompt. Anyone running overnight autoresearch should have a story for how they catch the false "done."
https://x.com/TeutaAi/status/2061760411699970500
A short but load-bearing reliability insight. Self-hosting model serving was the easy part, he says; the agent loop is where it broke. His stop-hook caught 35 hallucinated "done" claims in a single sprint before he trusted any autonomous run. That number is the whole lesson: the gap between "the agent says it finished" and "the agent actually finished" is the central failure mode of any long-running loop, and the fix is a hard verification gate, not a better prompt. Anyone running overnight autoresearch should have a story for how they catch the false "done."
#14
@fenestbuc
https://x.com/fenestbuc/status/2061717013580652665
A concrete, unglamorous production use of autoresearch. His team at kubarlabs is adapting llm-autoresearch to build ultra-cheap, hyper-specialized small language models for their decision-prep pipeline. It's a single sentence, but it points at the real near-term payoff: rather than using autoresearch to chase frontier benchmarks, you use the loop to distill purpose-built SLMs that are cheap to run on a narrow task. The orgs that win with this aren't necessarily building the smartest model; they're using the research loop to manufacture the cheapest model that's good enough for one job.
https://x.com/fenestbuc/status/2061717013580652665
A concrete, unglamorous production use of autoresearch. His team at kubarlabs is adapting llm-autoresearch to build ultra-cheap, hyper-specialized small language models for their decision-prep pipeline. It's a single sentence, but it points at the real near-term payoff: rather than using autoresearch to chase frontier benchmarks, you use the loop to distill purpose-built SLMs that are cheap to run on a narrow task. The orgs that win with this aren't necessarily building the smartest model; they're using the research loop to manufacture the cheapest model that's good enough for one job.
π‘ Eco Products Radar
Eco Products Radar
ecdsa.fail is the breakout autoresearch platform of the week, a live leaderboard where anyone points an agent at quantum-circuit optimization and the verifier doubles as an automated reward function. Hermes Agent (Nous Research) keeps showing up as the default self-improving, persistent local agent, usually paired with Obsidian as the vault that gives it compounding memory, wired together through Filesystem MCP. Claude Code and Codex appear repeatedly as the executor-plus-adversarial-reviewer pairing for research loops (ARIS uses exactly this split). Karpathy-style autoresearch is the shared mental model under nearly every post above, less a product than the reference design everyone is now building against.
ecdsa.fail is the breakout autoresearch platform of the week, a live leaderboard where anyone points an agent at quantum-circuit optimization and the verifier doubles as an automated reward function. Hermes Agent (Nous Research) keeps showing up as the default self-improving, persistent local agent, usually paired with Obsidian as the vault that gives it compounding memory, wired together through Filesystem MCP. Claude Code and Codex appear repeatedly as the executor-plus-adversarial-reviewer pairing for research loops (ARIS uses exactly this split). Karpathy-style autoresearch is the shared mental model under nearly every post above, less a product than the reference design everyone is now building against.
Comments