Loop Daily: June 8, 2026
The autoresearch crowd spent the day proving that the interesting unit of work is no longer a single answer but a loop that runs for hours. The sharpest examples weren't demos β they were people pointing token-hungry agent swarms at real problems: training thousands of trading models in a market simulator, reproducing and beating fresh CVPR papers overnight, retro-hunting a four-year-old cryptography bug. Underneath the hype, a real debate is forming about where autoresearch actually works (anything the loop can self-validate) and where it falls apart (judgment-heavy, open-ended work). And a recurring theme: the harness, not the model, is doing most of the lifting β and most of the spending.
#1
@sterlingcrispin
https://x.com/sterlingcrispin/status/2063312130271797569
Maybe the cleanest "money loop" of the day. He's near a flywheel where $1 of tokens yields more than $1 of profit in algorithmic trading β and crucially, the LLMs don't make trades. They run agentic autoresearch swarms that train models and run evals inside a market simulator over terabytes of data. The bottleneck he names is honest and specific: compute to train thousands of time-series models, and the tokens to run the researcher agents. It's a concrete picture of autoresearch as an industrial process, not a chatbot trick β burn tokens to discover and validate strategies, let the simulator be the judge.
https://x.com/sterlingcrispin/status/2063312130271797569
Maybe the cleanest "money loop" of the day. He's near a flywheel where $1 of tokens yields more than $1 of profit in algorithmic trading β and crucially, the LLMs don't make trades. They run agentic autoresearch swarms that train models and run evals inside a market simulator over terabytes of data. The bottleneck he names is honest and specific: compute to train thousands of time-series models, and the tokens to run the researcher agents. It's a concrete picture of autoresearch as an industrial process, not a chatbot trick β burn tokens to discover and validate strategies, let the simulator be the judge.
#2
@AutoSOTA11
https://x.com/AutoSOTA11/status/2063351470108352683
This is autoresearch eating live academic work. Following a fresh CVPR paper on open-world 3D reasoning segmentation, AutoSOTA reproduced it and pushed mIoU to 77.86%, a +7.1% improvement, using hybrid SAM boundary refinement with dilated and eroded mask constraints. The account is doing this repeatedly against new papers β an autonomous loop that reads a result, reimplements it, and then searches for a concrete improvement. Whether or not every claim holds up to outside replication, the workflow itself is the signal: the gap between "paper published" and "paper extended" is collapsing to a single overnight run.
https://x.com/AutoSOTA11/status/2063351470108352683
This is autoresearch eating live academic work. Following a fresh CVPR paper on open-world 3D reasoning segmentation, AutoSOTA reproduced it and pushed mIoU to 77.86%, a +7.1% improvement, using hybrid SAM boundary refinement with dilated and eroded mask constraints. The account is doing this repeatedly against new papers β an autonomous loop that reads a result, reimplements it, and then searches for a concrete improvement. Whether or not every claim holds up to outside replication, the workflow itself is the signal: the gap between "paper published" and "paper extended" is collapsing to a single overnight run.
#3
@cv_usk
https://x.com/cv_usk/status/2063126990933172569
A genuinely useful benchmark for this whole category. AUTOLAB tests whether frontier models can sustain iterative optimization for 2 to 12 hours β 36 tasks across 17 models and 3 trials, 1,152 runs totaling 2,544 wall-clock hours and 8.6 billion tokens. The headline finding is that long-horizon optimization is a different skill from one-shot coding: success is driven by persistence in iterating, not initial solution quality. claude-opus-4.6 dominated with a 0.93 win rate and, on one Flash Attention task, hit a 42.4x speedup through 44 feedback-driven iterations. The kicker for anyone building loops: harness choice alone swung the same model's score by up to 0.43.
https://x.com/cv_usk/status/2063126990933172569
A genuinely useful benchmark for this whole category. AUTOLAB tests whether frontier models can sustain iterative optimization for 2 to 12 hours β 36 tasks across 17 models and 3 trials, 1,152 runs totaling 2,544 wall-clock hours and 8.6 billion tokens. The headline finding is that long-horizon optimization is a different skill from one-shot coding: success is driven by persistence in iterating, not initial solution quality. claude-opus-4.6 dominated with a 0.93 win rate and, on one Flash Attention task, hit a 42.4x speedup through 44 feedback-driven iterations. The kicker for anyone building loops: harness choice alone swung the same model's score by up to 0.43.
#4
@AntFleetDev
https://x.com/AntFleetDev/status/2063170129593262239
An honest receipt for agentic security review. After Taylor Hornby disclosed a four-year-old counterfeiting bug in Zcash Orchard β caught using Opus 4.8 plus a custom audit harness β AntFleet re-ran its own pipeline blind against the 2021 commit that introduced it. Their generalist gate (two frontier models, surfacing only findings both agree on) missed the exact defect but flagged adjacent soundness issues. Then with a thin 50-line domain-context block prepended, GPT-5 hit the defect class in ~140 seconds for under a dollar. Their takeaway is the most useful part: domain priors compound, and a unanimous AND-gate is right for PR-time noise control but wrong for deep targeted audits.
https://x.com/AntFleetDev/status/2063170129593262239
An honest receipt for agentic security review. After Taylor Hornby disclosed a four-year-old counterfeiting bug in Zcash Orchard β caught using Opus 4.8 plus a custom audit harness β AntFleet re-ran its own pipeline blind against the 2021 commit that introduced it. Their generalist gate (two frontier models, surfacing only findings both agree on) missed the exact defect but flagged adjacent soundness issues. Then with a thin 50-line domain-context block prepended, GPT-5 hit the defect class in ~140 seconds for under a dollar. Their takeaway is the most useful part: domain priors compound, and a unanimous AND-gate is right for PR-time noise control but wrong for deep targeted audits.
#5
@topher_gabriel
https://x.com/topher_gabriel/status/2063376028064714863
A clear picture of weekend autoresearch on owned hardware. He runs long research experiments β model training plus autoresearch loops β over the weekend on an NVIDIA Thor, working on AGI substrates. His frustration is instructive: he switched from Claude to Codex because, paying $200/month, he wanted a workhorse that keeps grinding rather than one that says "let's pick this up in the morning." For genuinely long-horizon autonomous runs, the personality that just keeps iterating matters as much as raw quality. A real researcher's view of what "agent that doesn't quit" actually requires.
https://x.com/topher_gabriel/status/2063376028064714863
A clear picture of weekend autoresearch on owned hardware. He runs long research experiments β model training plus autoresearch loops β over the weekend on an NVIDIA Thor, working on AGI substrates. His frustration is instructive: he switched from Claude to Codex because, paying $200/month, he wanted a workhorse that keeps grinding rather than one that says "let's pick this up in the morning." For genuinely long-horizon autonomous runs, the personality that just keeps iterating matters as much as raw quality. A real researcher's view of what "agent that doesn't quit" actually requires.
#6
@SinaShahandeh
https://x.com/SinaShahandeh/status/2063218279548617177
A grounding counterpoint from someone who shipped autoresearch in a real medical-device company. At Radicait they built an ML system for cancer diagnostics, and his point is that in the practical sciences the real bottlenecks are more than a clean autoresearch hill-climb. Regulatory constraints, messy data, and validation against physical reality don't reduce to a metric you can optimize overnight. It's the most useful kind of skepticism β not "autoresearch is hype" but "here's where the clean loop meets the wall." Worth reading against the day's flood of frictionless overnight-miracle threads.
https://x.com/SinaShahandeh/status/2063218279548617177
A grounding counterpoint from someone who shipped autoresearch in a real medical-device company. At Radicait they built an ML system for cancer diagnostics, and his point is that in the practical sciences the real bottlenecks are more than a clean autoresearch hill-climb. Regulatory constraints, messy data, and validation against physical reality don't reduce to a metric you can optimize overnight. It's the most useful kind of skepticism β not "autoresearch is hype" but "here's where the clean loop meets the wall." Worth reading against the day's flood of frictionless overnight-miracle threads.
#7
@heisCo_ok
https://x.com/heisCo_ok/status/2063235195839348799
A dense weekly digest of the research actually powering self-improving agents. Four papers stand out: OPUS scores training data by its usefulness in the optimizer's update space and hit stronger results with 30B tokens than some 200B-token runs; SkillOpt treats an agent's skill document as trainable memory, accepting only edits that improve validation and delivering +20 points on GPT-5.5; ECHO has terminal agents predict environment observations to roughly double pass@1 on TerminalBench-2.0; and CPT lets parallel reasoning branches share discoveries instead of duplicating work. Together they sketch where the field is heading: data-efficient training and agents that improve without costly retraining.
https://x.com/heisCo_ok/status/2063235195839348799
A dense weekly digest of the research actually powering self-improving agents. Four papers stand out: OPUS scores training data by its usefulness in the optimizer's update space and hit stronger results with 30B tokens than some 200B-token runs; SkillOpt treats an agent's skill document as trainable memory, accepting only edits that improve validation and delivering +20 points on GPT-5.5; ECHO has terminal agents predict environment observations to roughly double pass@1 on TerminalBench-2.0; and CPT lets parallel reasoning branches share discoveries instead of duplicating work. Together they sketch where the field is heading: data-efficient training and agents that improve without costly retraining.
#8
@curonianai
https://x.com/curonianai/status/2063354044597289396
A sharp breakdown of MLEvolve out of Shanghai AI Lab, which fixes the three dumbest parts of today's "self-improving" agents. First, the agents leave notes for each other, so one hitting a wall doesn't make the rest waste a turn on it. Second, they get a memory of past wins and flops instead of starting cold every run. Third, the job is split β one plans, one codes β and it chooses between a tiny patch and a full rewrite instead of nuking the file by reflex. It reportedly hit top benchmark scores in half the usual time and beat DeepMind's AlphaEvolve on math it wasn't built for. The author's honest caveat: the lab is grading its own homework, so wait for outside replication.
https://x.com/curonianai/status/2063354044597289396
A sharp breakdown of MLEvolve out of Shanghai AI Lab, which fixes the three dumbest parts of today's "self-improving" agents. First, the agents leave notes for each other, so one hitting a wall doesn't make the rest waste a turn on it. Second, they get a memory of past wins and flops instead of starting cold every run. Third, the job is split β one plans, one codes β and it chooses between a tiny patch and a full rewrite instead of nuking the file by reflex. It reportedly hit top benchmark scores in half the usual time and beat DeepMind's AlphaEvolve on math it wasn't built for. The author's honest caveat: the lab is grading its own homework, so wait for outside replication.
#9
@cv_usk
https://x.com/cv_usk/status/2063409603690250543
The day's best engineering discipline post: don't hand your whole business workflow to an LLM. Fix the skeleton in code as a DAG or state machine, then make each node swappable between deterministic code, a single LLM call, or a mini-agent β inject probabilistic flexibility only where it's genuinely needed. When an agent autonomously drives an entire flow, failures compound: steps get skipped, schema mismatches cascade, and the model decides to "investigate further" in an infinite loop. His rule of thumb: use an agentic loop only when the LLM needs to discover the steps itself; otherwise the backbone stays testable and auditable. Think Airflow or Temporal step design applied to agents.
https://x.com/cv_usk/status/2063409603690250543
The day's best engineering discipline post: don't hand your whole business workflow to an LLM. Fix the skeleton in code as a DAG or state machine, then make each node swappable between deterministic code, a single LLM call, or a mini-agent β inject probabilistic flexibility only where it's genuinely needed. When an agent autonomously drives an entire flow, failures compound: steps get skipped, schema mismatches cascade, and the model decides to "investigate further" in an infinite loop. His rule of thumb: use an agentic loop only when the LLM needs to discover the steps itself; otherwise the backbone stays testable and auditable. Think Airflow or Temporal step design applied to agents.
#10
@kirako0o
https://x.com/kirako0o/status/2063331030199832945
A clean explanation of what actually makes a loop "self-improving." After a task, Hermes writes a skill file from what it learned, so the next session already knows how to do that job better. The architecture is three-tier memory β session context, cross-session patterns, permanent knowledge β which he notes is the loop big tech built whole teams around in the 90s. GEPA optimization has the agent critique its own output, grade the approach, and revise before handing you the result, so you never get the first draft. The same config runs across parallel agents pulling from one shared memory layer. The framing that lands: this runs like a company org chart with zero payroll.
https://x.com/kirako0o/status/2063331030199832945
A clean explanation of what actually makes a loop "self-improving." After a task, Hermes writes a skill file from what it learned, so the next session already knows how to do that job better. The architecture is three-tier memory β session context, cross-session patterns, permanent knowledge β which he notes is the loop big tech built whole teams around in the 90s. GEPA optimization has the agent critique its own output, grade the approach, and revise before handing you the result, so you never get the first draft. The same config runs across parallel agents pulling from one shared memory layer. The framing that lands: this runs like a company org chart with zero payroll.
#11
@GoldRayson
https://x.com/GoldRayson/status/2063051333087695001
The shadow side of agent loops, and a real one. They found a $6,800/month runaway agent loop that nobody knew about β a company running 12 agents, one stuck in an infinite loop, unnoticed for weeks. It's the natural failure mode of "set it and forget it" autonomy: a loop with no exit condition and no cost ceiling will happily burn money forever. The lesson for anyone shipping autoresearch or 24/7 agents is budgets, kill-switches and observability are not optional features, they're the difference between a flywheel and a leak.
https://x.com/GoldRayson/status/2063051333087695001
The shadow side of agent loops, and a real one. They found a $6,800/month runaway agent loop that nobody knew about β a company running 12 agents, one stuck in an infinite loop, unnoticed for weeks. It's the natural failure mode of "set it and forget it" autonomy: a loop with no exit condition and no cost ceiling will happily burn money forever. The lesson for anyone shipping autoresearch or 24/7 agents is budgets, kill-switches and observability are not optional features, they're the difference between a flywheel and a leak.
#12
@jakkbtc
https://x.com/jakkbtc/status/2063096722939605057
A working teacher/student self-improving stack, built outside a big lab. A teacher agent generates rank-appropriate challenges and scores responses 0-100; a student starts with minimal knowledge and only accumulates lessons it earns (90+ to advance); mastery-based progression reverts it if it's promoted too fast. Five agents run in parallel across disciplines, all autonomous and logged. The point they're making is that the idea isn't novel β adversarial agent training and teacher/student loops are what the big labs tout β but they built it on open APIs and commodity hardware for an offensive-security operation. Self-improving loops don't require a $10B compute budget to be useful.
https://x.com/jakkbtc/status/2063096722939605057
A working teacher/student self-improving stack, built outside a big lab. A teacher agent generates rank-appropriate challenges and scores responses 0-100; a student starts with minimal knowledge and only accumulates lessons it earns (90+ to advance); mastery-based progression reverts it if it's promoted too fast. Five agents run in parallel across disciplines, all autonomous and logged. The point they're making is that the idea isn't novel β adversarial agent training and teacher/student loops are what the big labs tout β but they built it on open APIs and commodity hardware for an offensive-security operation. Self-improving loops don't require a $10B compute budget to be useful.
#13
@kreoxi
https://x.com/kreoxi/status/2063179739968205053
A fully local self-improving agent worth watching. Caitlin runs entirely on his own PC β local LLMs on an RTX 3080, nothing leaves the machine β with vector plus Obsidian-vault memory, screen and image vision, and the ability to learn AI trends from his X feed. The self-improving part is gated correctly: she proposes her own upgrades and he approves them. It's a grounded version of the autonomy everyone's chasing β local, private, human-in-the-loop on the upgrades β rather than a cloud agent with full permissions and no brakes.
https://x.com/kreoxi/status/2063179739968205053
A fully local self-improving agent worth watching. Caitlin runs entirely on his own PC β local LLMs on an RTX 3080, nothing leaves the machine β with vector plus Obsidian-vault memory, screen and image vision, and the ability to learn AI trends from his X feed. The self-improving part is gated correctly: she proposes her own upgrades and he approves them. It's a grounded version of the autonomy everyone's chasing β local, private, human-in-the-loop on the upgrades β rather than a cloud agent with full permissions and no brakes.
#14
@alokbishoyi97
https://x.com/alokbishoyi97/status/2063286594438918342
A concrete description of how a modern autoresearch loop is actually wired. The agent being optimized is a Qwen3.6 setup with a Hermes scaffold; the driver of the EVO loop is Codex running GPT-5.5. You point EVO at the task you want the agent to get better at, and it sets up a baseline and datasets, then does autoresearch β improving skills or fine-tuning the model β to push the metrics up. It's a clean separation of roles: one model drives the search, another agent is the thing being improved. A useful mental model for anyone trying to turn "make my agent better" into an actual measurable loop.
https://x.com/alokbishoyi97/status/2063286594438918342
A concrete description of how a modern autoresearch loop is actually wired. The agent being optimized is a Qwen3.6 setup with a Hermes scaffold; the driver of the EVO loop is Codex running GPT-5.5. You point EVO at the task you want the agent to get better at, and it sets up a baseline and datasets, then does autoresearch β improving skills or fine-tuning the model β to push the metrics up. It's a clean separation of roles: one model drives the search, another agent is the thing being improved. A useful mental model for anyone trying to turn "make my agent better" into an actual measurable loop.
π‘ Eco Products Radar
Eco Products Radar
Hermes Agent β Nous Research's open-source self-improving agent; writes skill files from experience, three-tier memory, runs 24/7. The most-referenced autoresearch substrate of the day, now with a native desktop release.
EVO / Evolver β autoresearch loop framework that sets up baselines and datasets, then improves an agent by tuning skills or the model itself; people are running it in parallel to optimize Hermes/Qwen setups cheaply.
AutoResearch (Karpathy) β the autonomous research framework that pulled 23k+ stars in days; repeatedly cited as the reference point for the whole self-improvement push.
AUTOLAB / Harbor harness β open benchmark and harness for long-horizon (2-12hr) auto-research and engineering tasks; the emerging yardstick for whether a model can actually sustain iteration.
AutoSOTA β autonomous pipeline reproducing and extending fresh CVPR papers to new SOTA; a live demo of research automation against the literature.
Claude Code /goal + /loop β the loop primitives most builders are using to turn one-shot prompts into self-validating, persistent runs; pairs with Codex's /goal for the same pattern.
Hermes Agent β Nous Research's open-source self-improving agent; writes skill files from experience, three-tier memory, runs 24/7. The most-referenced autoresearch substrate of the day, now with a native desktop release.
EVO / Evolver β autoresearch loop framework that sets up baselines and datasets, then improves an agent by tuning skills or the model itself; people are running it in parallel to optimize Hermes/Qwen setups cheaply.
AutoResearch (Karpathy) β the autonomous research framework that pulled 23k+ stars in days; repeatedly cited as the reference point for the whole self-improvement push.
AUTOLAB / Harbor harness β open benchmark and harness for long-horizon (2-12hr) auto-research and engineering tasks; the emerging yardstick for whether a model can actually sustain iteration.
AutoSOTA β autonomous pipeline reproducing and extending fresh CVPR papers to new SOTA; a live demo of research automation against the literature.
Claude Code /goal + /loop β the loop primitives most builders are using to turn one-shot prompts into self-validating, persistent runs; pairs with Codex's /goal for the same pattern.
Comments