Loop Daily: 2026-06-09
If yesterday was about agents writing code, today was about agents doing science. The thread running through every good post: stop steering the model, hand it a metric and a budget, and let it run experiments on itself until the number moves. An open-source engine quietly beat a 120B model on a Chinese law benchmark, and the winning solution it found had no LLM in it at all. A protein agent got harder problems for four rounds and grew its data 10x while its code barely changed. And a new crop of self-improving skill systems is turning every session into training data for the next one. Here is what people actually ran.
#1
@TheGodfath13541
https://x.com/TheGodfath13541/status/2063586909977207110
The clearest autoresearch result of the day: the open-source engine evo was handed LawBench (Chinese criminal law, 191 possible charges) with no instructions, just a system and a definition of 'better' and a budget. A well-funded startup had trained a 120B model and scored 0.701; evo came back at 0.7766. The kicker is that evo first tried the expensive path (multiple LoRA runs on the 120B), saw the gains weren't there, pruned it, and the winning solution it shipped has no LLM at all, a lean classical classifier that runs on a laptop. Occam's razor found by search instead of assumed up front, with every experiment public.
https://x.com/TheGodfath13541/status/2063586909977207110
The clearest autoresearch result of the day: the open-source engine evo was handed LawBench (Chinese criminal law, 191 possible charges) with no instructions, just a system and a definition of 'better' and a budget. A well-funded startup had trained a 120B model and scored 0.701; evo came back at 0.7766. The kicker is that evo first tried the expensive path (multiple LoRA runs on the 120B), saw the gains weren't there, pruned it, and the winning solution it shipped has no LLM at all, a lean classical classifier that runs on a laptop. Occam's razor found by search instead of assumed up front, with every experiment public.
#2
@alokbishoyi97
https://x.com/alokbishoyi97/status/2063579673704144973
evo crossed 1,000 GitHub stars, and the builder restated the goal plainly: make evo the easiest and best way to run autoresearch on any codebase you already have. The framing matters, this is autoresearch aimed not at frontier labs but at anyone with a repo and a metric they want to move. The fast adoption (and the LawBench receipts) suggest the 'point it at your own code' pitch is landing.
https://x.com/alokbishoyi97/status/2063579673704144973
evo crossed 1,000 GitHub stars, and the builder restated the goal plainly: make evo the easiest and best way to run autoresearch on any codebase you already have. The framing matters, this is autoresearch aimed not at frontier labs but at anyone with a repo and a metric they want to move. The fast adoption (and the LawBench receipts) suggest the 'point it at your own code' pitch is landing.
#3
@AutoSOTA11
https://x.com/AutoSOTA11/status/2063626072453976184
A concrete autoresearch optimization run on top of a fresh CVPR paper (ChordEdit). Using AutoSOTA, they pushed two orthogonal directions: on the algorithm side, cleanup blending plus prompt-similarity auto-tuning; on the systems side, TF32 Tensor Core, FlashAttention-2, and inference_mode. Result: PSNR up from 23.02 to 25.11 (+9.1%), latency down 32.4% to 0.25s/image, with image quality essentially preserved. A clean demonstration that an agent can take a just-published method and immediately squeeze it further.
https://x.com/AutoSOTA11/status/2063626072453976184
A concrete autoresearch optimization run on top of a fresh CVPR paper (ChordEdit). Using AutoSOTA, they pushed two orthogonal directions: on the algorithm side, cleanup blending plus prompt-similarity auto-tuning; on the systems side, TF32 Tensor Core, FlashAttention-2, and inference_mode. Result: PSNR up from 23.02 to 25.11 (+9.1%), latency down 32.4% to 0.25s/image, with image quality essentially preserved. A clean demonstration that an agent can take a just-published method and immediately squeeze it further.
#4
@HeyZohaib
https://x.com/HeyZohaib/status/2063758198658695182
A three-layer self-improving research agent architecture worth copying: layer 1 updates the codebase, MCP, and services; layer 2 is a research orchestrator that spawns and unblocks agents using QA-oriented skill rules; layer 3 does the actual research and pushes results through an in-house MCP. The skills self-improve in batches, but he deliberately keeps a human in the loop for non-obvious behavior changes so the skill library doesn't degenerate into a dump of edge cases.
https://x.com/HeyZohaib/status/2063758198658695182
A three-layer self-improving research agent architecture worth copying: layer 1 updates the codebase, MCP, and services; layer 2 is a research orchestrator that spawns and unblocks agents using QA-oriented skill rules; layer 3 does the actual research and pushes results through an in-house MCP. The skills self-improve in batches, but he deliberately keeps a human in the loop for non-obvious behavior changes so the skill library doesn't degenerate into a dump of edge cases.
#5
@omarsar0
https://x.com/omarsar0/status/2063668567447597273
A sharp paper summary on what self-improvement should even optimize for. It distinguishes retrieval vs search vs discovery, and uses category theory to test whether an agent produced genuinely new concepts. Their Builder/Breaker agent studying protein mechanics took on harder proteins over four rounds, growing its data roughly 10x while the model code grew only 1.3x. The argument: compressing more of the world into less code is a better success signal than accuracy alone, because optimizing for accuracy just makes an agent settle on easy benchmarks and stop.
https://x.com/omarsar0/status/2063668567447597273
A sharp paper summary on what self-improvement should even optimize for. It distinguishes retrieval vs search vs discovery, and uses category theory to test whether an agent produced genuinely new concepts. Their Builder/Breaker agent studying protein mechanics took on harder proteins over four rounds, growing its data roughly 10x while the model code grew only 1.3x. The argument: compressing more of the world into less code is a better success signal than accuracy alone, because optimizing for accuracy just makes an agent settle on easy benchmarks and stop.
#6
@rohanpaul_ai
https://x.com/rohanpaul_ai/status/2063698758517366884
A reality-check benchmark: the Meta-Agent Challenge (MAC) tests whether AI agents can autonomously build better AI agents across math, science, competitive programming, bug fixing, and terminal tasks. The finding is sobering, current agents usually fail to beat strong human-made agent setups, and the few good results come from frontier models like Claude. The conclusion worth holding onto amid the hype: agents are powerful executors, but not yet self-improving engineers.
https://x.com/rohanpaul_ai/status/2063698758517366884
A reality-check benchmark: the Meta-Agent Challenge (MAC) tests whether AI agents can autonomously build better AI agents across math, science, competitive programming, bug fixing, and terminal tasks. The finding is sobering, current agents usually fail to beat strong human-made agent setups, and the few good results come from frontier models like Claude. The conclusion worth holding onto amid the hype: agents are powerful executors, but not yet self-improving engineers.
#7
@Trace_Cohen
https://x.com/Trace_Cohen/status/2063435099392114879
A small but beautifully closed loop: a self-improving SEO/AEO agent where each prompt-improver run reads the previous run's improvements.md, so it can't repeat fixes and is forced to find new signal. It has already run twice, made seven targeted improvements, all traced to Google Search Console numbers, and the plan is to later check whether an FAQ Page schema change actually moved CTR, confirming or revising the hypothesis. This is autoresearch applied to marketing, with a real metric and a memory of what it already tried.
https://x.com/Trace_Cohen/status/2063435099392114879
A small but beautifully closed loop: a self-improving SEO/AEO agent where each prompt-improver run reads the previous run's improvements.md, so it can't repeat fixes and is forced to find new signal. It has already run twice, made seven targeted improvements, all traced to Google Search Console numbers, and the plan is to later check whether an FAQ Page schema change actually moved CTR, confirming or revising the hypothesis. This is autoresearch applied to marketing, with a real metric and a memory of what it already tried.
#8
@yungbose
https://x.com/yungbose/status/2063648136267202910
Shared 'upskill', a file-based system for recursive self-improving agent skills, inspired by Microsoft's SkillOpt paper and Garry Tan's gstack. It reads each run, runs a retro, and folds the improvements back into markdown files in a git-native way, including improving the meta-skill itself. It auto-loads context, stores and evolves prompts and workflows, and works as a skill in any agent harness including Codex. The whole point is making self-improvement repeatable and low-cognitive-load instead of a one-off.
https://x.com/yungbose/status/2063648136267202910
Shared 'upskill', a file-based system for recursive self-improving agent skills, inspired by Microsoft's SkillOpt paper and Garry Tan's gstack. It reads each run, runs a retro, and folds the improvements back into markdown files in a git-native way, including improving the meta-skill itself. It auto-loads context, stores and evolves prompts and workflows, and works as a skill in any agent harness including Codex. The whole point is making self-improvement repeatable and low-cognitive-load instead of a one-off.
#9
@gauthampai
https://x.com/gauthampai/status/2063579656712823155
A deep technical workflow: a prompt-to-DAG planner and executor that turns a prompt into a declarative plan with deterministic stages (skipping the LLM entirely) and stochastic stages (using it), all typed and persisted so it survives reboots. You can step through stages like a debugger, rerun them, or rewrite the plan on the fly, and it handles fan-out, fan-in, loop-until-done, and approval gates. He applied it to Karpathy's autoresearch project by pointing the agent at the program.md, and it generated the whole plan on the first try.
https://x.com/gauthampai/status/2063579656712823155
A deep technical workflow: a prompt-to-DAG planner and executor that turns a prompt into a declarative plan with deterministic stages (skipping the LLM entirely) and stochastic stages (using it), all typed and persisted so it survives reboots. You can step through stages like a debugger, rerun them, or rewrite the plan on the fly, and it handles fan-out, fan-in, loop-until-done, and approval gates. He applied it to Karpathy's autoresearch project by pointing the agent at the program.md, and it generated the whole plan on the first try.
#10
@ViceSol
https://x.com/ViceSol/status/2063576473416405147
A walkthrough of someone's 'JARVIS' pipeline that turns a 3am idea into a shipped project overnight. Six stages, only one of them human: capture a raw note, classify it (project/task/idea/reference), route it five ways, then auto-research it (WebSearch x4, WebFetch x2, findings and sources logged), pause at a single human approval gate, and finally execute, a PM agent spawns research/build/test/deploy/review workers on different models with echo-chamber prevention. The whole thing runs while he sleeps; the trick is the layer between the idea and the execution that doesn't wait on you.
https://x.com/ViceSol/status/2063576473416405147
A walkthrough of someone's 'JARVIS' pipeline that turns a 3am idea into a shipped project overnight. Six stages, only one of them human: capture a raw note, classify it (project/task/idea/reference), route it five ways, then auto-research it (WebSearch x4, WebFetch x2, findings and sources logged), pause at a single human approval gate, and finally execute, a PM agent spawns research/build/test/deploy/review workers on different models with echo-chamber prevention. The whole thing runs while he sleeps; the trick is the layer between the idea and the execution that doesn't wait on you.
#11
@DimitrisPapail
https://x.com/DimitrisPapail/status/2063646403562213532
A power user's feature request that doubles as a usage report: he uses Codex a lot for autoresearch, but says the model is bad at the last mile, telling the story of how the final solution was reached. He wants a companion writer model that pulls the whole experimental trajectory together into a coherent narrative. It's a real gap, when the agent runs hundreds of experiments, the human still needs to understand why the winner won.
https://x.com/DimitrisPapail/status/2063646403562213532
A power user's feature request that doubles as a usage report: he uses Codex a lot for autoresearch, but says the model is bad at the last mile, telling the story of how the final solution was reached. He wants a companion writer model that pulls the whole experimental trajectory together into a coherent narrative. It's a real gap, when the agent runs hundreds of experiments, the human still needs to understand why the winner won.
#12
@cv_usk
https://x.com/cv_usk/status/2063771991404933140
A detailed pattern doc for the autonomous agent loop, centered on budgets. Build a ReAct-style observe-think-act loop with three budget dimensions, step count, token cost, and wall-clock time, and terminate on completion, budget exhaustion, or stuck detection, with enforcement in code rather than the LLM's self-report. Key moves: inject remaining budget into the system prompt so the model can decide to summarize when it's nearly out, never return empty on exhaustion (give a partial-result fallback), and compress history to fight context bloat. The unglamorous engineering that makes long loops safe.
https://x.com/cv_usk/status/2063771991404933140
A detailed pattern doc for the autonomous agent loop, centered on budgets. Build a ReAct-style observe-think-act loop with three budget dimensions, step count, token cost, and wall-clock time, and terminate on completion, budget exhaustion, or stuck detection, with enforcement in code rather than the LLM's self-report. Key moves: inject remaining budget into the system prompt so the model can decide to summarize when it's nearly out, never return empty on exhaustion (give a partial-result fallback), and compress history to fight context bloat. The unglamorous engineering that makes long loops safe.
#13
@SolJuvan
https://x.com/SolJuvan/status/2063753798711931109
A self-improving 'AI brain': an autonomous Hermes agent running 24/7 on a VPS, permanently wired into a personal Obsidian vault via Filesystem MCP. Before it reasons it pulls context from the vault; every output it produces gets written back as new notes. That closed feedback loop means the more it's used, the smarter and more personalized it gets, with permanent memory living in plain files. The cheap, durable version of a personal model that keeps learning you.
https://x.com/SolJuvan/status/2063753798711931109
A self-improving 'AI brain': an autonomous Hermes agent running 24/7 on a VPS, permanently wired into a personal Obsidian vault via Filesystem MCP. Before it reasons it pulls context from the vault; every output it produces gets written back as new notes. That closed feedback loop means the more it's used, the smarter and more personalized it gets, with permanent memory living in plain files. The cheap, durable version of a personal model that keeps learning you.
#14
@nateberkopec
https://x.com/nateberkopec/status/2063731591650979971
A clarifying take on what 'loop' even means with LLMs: stop babysitting the model and build a non-interactive AI application instead. Loops can be simple (ralph, autoresearch) or complex, but the assignment is always the same, 'build the thing that builds the thing.' It's the cleanest one-line framing of why agentic loops matter, the goal isn't a better chat, it's removing yourself from the inner loop entirely.
https://x.com/nateberkopec/status/2063731591650979971
A clarifying take on what 'loop' even means with LLMs: stop babysitting the model and build a non-interactive AI application instead. Loops can be simple (ralph, autoresearch) or complex, but the assignment is always the same, 'build the thing that builds the thing.' It's the cleanest one-line framing of why agentic loops matter, the goal isn't a better chat, it's removing yourself from the inner loop entirely.
π‘ Eco Products Radar
Eco Products Radar
Tools and projects mentioned three or more times across today's loop posts.
evo (evo-hq) - the open-source autoresearch engine behind the LawBench result; point it at a codebase and a metric.
upskill - file-based recursive self-improving skill system, git-native, harness-agnostic.
Hermes - the agent runtime people leave running 24/7 for self-improving, memory-backed loops.
Codex - the executor of choice for autoresearch runs, paired with autoresearch orchestrators.
Obsidian + MCP - the persistent-memory substrate for self-improving agents.
AutoSOTA - agentic framework for squeezing further gains out of just-published papers.
LangGraph - still the go-to for building stateful, self-correcting agent loops.
Tools and projects mentioned three or more times across today's loop posts.
evo (evo-hq) - the open-source autoresearch engine behind the LawBench result; point it at a codebase and a metric.
upskill - file-based recursive self-improving skill system, git-native, harness-agnostic.
Hermes - the agent runtime people leave running 24/7 for self-improving, memory-backed loops.
Codex - the executor of choice for autoresearch runs, paired with autoresearch orchestrators.
Obsidian + MCP - the persistent-memory substrate for self-improving agents.
AutoSOTA - agentic framework for squeezing further gains out of just-published papers.
LangGraph - still the go-to for building stateful, self-correcting agent loops.
Comments