Loop Daily: 2026-06-05
Today the loop stopped being a metaphor and started leaving receipts. The strongest cases all share one mechanic, spend tokens letting the agent grade and rewrite itself until a measured number moves: a skill self-tuning from 0.73 to 0.93, an Opus autoresearch run clawing 53% to 78% but only when bullied past each plateau, and a Microsoft self-improvement loop that logs a plan, script, screenshots and JSON evidence for every single run. The frontier is splitting in two directions at once. One is escaping the datacenter, local Hermes-plus-Qwen loops on a Mac mini that personalize through accumulated skills, and on-device medical agents that keep the whole loop private. The other leaves software entirely, an autonomous biology pipeline whose feedback signal is literal molecules surviving a wet lab. The lesson everyone is converging on: it's the harness that has to adapt, not the model.
#1
@omarsar0
https://x.com/omarsar0/status/2062204469538881988
This is the cleanest self-improving-skill result of the day. He bolted Microsoft's SkillOpt framework onto his own agent orchestrator and let a skill self-evolve against a built-in test harness. On his paper-figure-extraction skill, a genuinely hard multimodal task, the loop pushed quality from 0.73 to 0.93, a full 20 points, and he says inspecting the extracted tables left him stunned. He's now pointing the same self-optimization loop at agent patterns, tool use, context engineering, evals and the harness itself. This is the literal mechanic of 100x intelligence, spend tokens letting the agent grade and rewrite itself until the number moves.
https://x.com/omarsar0/status/2062204469538881988
This is the cleanest self-improving-skill result of the day. He bolted Microsoft's SkillOpt framework onto his own agent orchestrator and let a skill self-evolve against a built-in test harness. On his paper-figure-extraction skill, a genuinely hard multimodal task, the loop pushed quality from 0.73 to 0.93, a full 20 points, and he says inspecting the extracted tables left him stunned. He's now pointing the same self-optimization loop at agent patterns, tool use, context engineering, evals and the harness itself. This is the literal mechanic of 100x intelligence, spend tokens letting the agent grade and rewrite itself until the number moves.
#2
@HenryL_AI
https://x.com/HenryL_AI/status/2062215518216757329
A serious result that should reframe how people think about self-improving agents. They show that even Opus-powered SOTA self-improving agents degrade on real-world task streams, because a single auto-harness overfits to past patterns. Their fix is a tree of regime-specific harness branches with per-task routing at solve time, same LLM, same auto-harness machinery, just specialized per task. The numbers are blunt: PolyBench 80.9% vs 50.8%, CTF-Dojo 50.2% vs 45.2%, FutureX 49.5% vs 47.5%. The thesis is the quotable part, it's the harness that has to be adaptive, not the model, which is exactly the autoresearch lesson the field keeps relearning.
https://x.com/HenryL_AI/status/2062215518216757329
A serious result that should reframe how people think about self-improving agents. They show that even Opus-powered SOTA self-improving agents degrade on real-world task streams, because a single auto-harness overfits to past patterns. Their fix is a tree of regime-specific harness branches with per-task routing at solve time, same LLM, same auto-harness machinery, just specialized per task. The numbers are blunt: PolyBench 80.9% vs 50.8%, CTF-Dojo 50.2% vs 45.2%, FutureX 49.5% vs 47.5%. The thesis is the quotable part, it's the harness that has to be adaptive, not the model, which is exactly the autoresearch lesson the field keeps relearning.
#3
@Vvsotnikov
https://x.com/Vvsotnikov/status/2062073965460234371
The most vivid token-for-intelligence trade of the day. On a noisy-data LLM-judge task, a naive baseline scored 53%, GEPA hit 67%, and Opus 4.8 autoresearch climbed in a brutally human way: hit 58%, gave up, got told to keep going, hit 65%, gave up, pushed again, 71%, then 78%. The loop literally needed to be bullied past each plateau, and when bullied it beat GEPA outright. The takeaway is uncomfortable and important, the ceiling isn't the model's capability, it's its willingness to keep spending tokens, and someone is clearly going to productize the nag.
https://x.com/Vvsotnikov/status/2062073965460234371
The most vivid token-for-intelligence trade of the day. On a noisy-data LLM-judge task, a naive baseline scored 53%, GEPA hit 67%, and Opus 4.8 autoresearch climbed in a brutally human way: hit 58%, gave up, got told to keep going, hit 65%, gave up, pushed again, 71%, then 78%. The loop literally needed to be bullied past each plateau, and when bullied it beat GEPA outright. The takeaway is uncomfortable and important, the ceiling isn't the model's capability, it's its willingness to keep spending tokens, and someone is clearly going to productize the nag.
#4
@james_y_zou
https://x.com/james_y_zou/status/2062184563737297038
Most autoresearch systems emulate a single researcher; his team built SimpleTES to emulate an entire research community of collaborating agents instead, and the payoff is concrete. New SOTA discoveries across 21 open science problems, including more efficient astrodynamics, a 2x faster LASSO, and better quantum circuit compilation. This is the non-coding frontier that matters most, autonomous agents producing real scientific results across domains, not hill-climbing a private benchmark. The community-of-agents framing is the idea worth stealing.
https://x.com/james_y_zou/status/2062184563737297038
Most autoresearch systems emulate a single researcher; his team built SimpleTES to emulate an entire research community of collaborating agents instead, and the payoff is concrete. New SOTA discoveries across 21 open science problems, including more efficient astrodynamics, a 2x faster LASSO, and better quantum circuit compilation. This is the non-coding frontier that matters most, autonomous agents producing real scientific results across domains, not hill-climbing a private benchmark. The community-of-agents framing is the idea worth stealing.
#5
@yuxiangwu_
https://x.com/yuxiangwu_/status/2062250177847562618
The sharpest skeptical take, which makes its evidence land harder. His point: autoresearch can hill-climb a benchmark, but the real test is whether it produces research the community actually builds on. Then he names a case, Aiden, an autoresearch agent that contributed 7 records to Parameter Golf, making it both the top contributor and the most-cited researcher in that community. That's not a private score, that's adoption, and it's the first time the autonomous-research story has a citation count attached. Worth watching whether this generalizes beyond a niche benchmark.
https://x.com/yuxiangwu_/status/2062250177847562618
The sharpest skeptical take, which makes its evidence land harder. His point: autoresearch can hill-climb a benchmark, but the real test is whether it produces research the community actually builds on. Then he names a case, Aiden, an autoresearch agent that contributed 7 records to Parameter Golf, making it both the top contributor and the most-cited researcher in that community. That's not a private score, that's adoption, and it's the first time the autonomous-research story has a citation count attached. Worth watching whether this generalizes beyond a niche benchmark.
#6
@MichaelGannotti
https://x.com/MichaelGannotti/status/2062321573084995862
A rare fully-auditable self-tuning workflow from inside Microsoft. His Scout/ClawPilot assistant uses Forgewright, a skill built on OpenClaw and Nous Hermes, as a daily self-improvement loop: every run produces a plan, an executable script, screenshots, logs and structured JSON evidence. It runs a SkillOpt-style tuning loop against frozen fixtures, makes bounded edits, scores each version on a rubric, and only keeps a change if it measurably improves, with every promotion gated by a human-reviewed packet. The scored output then feeds a competitive-landscape dashboard that refreshes daily. This is what disciplined self-improvement looks like when you make it leave a paper trail.
https://x.com/MichaelGannotti/status/2062321573084995862
A rare fully-auditable self-tuning workflow from inside Microsoft. His Scout/ClawPilot assistant uses Forgewright, a skill built on OpenClaw and Nous Hermes, as a daily self-improvement loop: every run produces a plan, an executable script, screenshots, logs and structured JSON evidence. It runs a SkillOpt-style tuning loop against frozen fixtures, makes bounded edits, scores each version on a rubric, and only keeps a change if it measurably improves, with every promotion gated by a human-reviewed packet. The scored output then feeds a competitive-landscape dashboard that refreshes daily. This is what disciplined self-improvement looks like when you make it leave a paper trail.
#7
@BioAIDevs
https://x.com/BioAIDevs/status/2062112187649540178
The most ambitious closed loop in the set, and it leaves the screen entirely. BIOS runs an autonomous biology pipeline: three generative models (PXDesign, BoltzGen, RFdiffusion3) produce 5,000 binder candidates per run, then scoring and molecular dynamics filter to 10-15 viable ones, which get commissioned for wet-lab synthesis at Adaptyv Bio, paid machine-to-machine via x402, results published on-chain. The returning wet-lab data feeds back into generation so the models learn which structures survived physical testing, starting each cycle stronger. They're adding an in-house pipetting robot to close the loop entirely. A self-improving system where the feedback signal is literal molecules surviving a lab is the real version of what everyone else is approximating in software.
https://x.com/BioAIDevs/status/2062112187649540178
The most ambitious closed loop in the set, and it leaves the screen entirely. BIOS runs an autonomous biology pipeline: three generative models (PXDesign, BoltzGen, RFdiffusion3) produce 5,000 binder candidates per run, then scoring and molecular dynamics filter to 10-15 viable ones, which get commissioned for wet-lab synthesis at Adaptyv Bio, paid machine-to-machine via x402, results published on-chain. The returning wet-lab data feeds back into generation so the models learn which structures survived physical testing, starting each cycle stronger. They're adding an in-house pipetting robot to close the loop entirely. A self-improving system where the feedback signal is literal molecules surviving a lab is the real version of what everyone else is approximating in software.
#8
@VukRosic99
https://x.com/VukRosic99/status/2062038511663116613
The purest expression of the token-abundance thesis. He has a MiniMax M3 subscription with so many tokens he treats them as effectively infinite and free, so he gave an agent a cheap GPU, told it to research LLM and transformer architectures, and just lets it run. Because budget anxiety is gone, the agent runs autonomous research indefinitely on minimal hardware. It's a scrappy proof of the core idea behind every case here, when tokens stop being scarce, sustained autonomous research becomes something a single person can just leave running.
https://x.com/VukRosic99/status/2062038511663116613
The purest expression of the token-abundance thesis. He has a MiniMax M3 subscription with so many tokens he treats them as effectively infinite and free, so he gave an agent a cheap GPU, told it to research LLM and transformer architectures, and just lets it run. Because budget anxiety is gone, the agent runs autonomous research indefinitely on minimal hardware. It's a scrappy proof of the core idea behind every case here, when tokens stop being scarce, sustained autonomous research becomes something a single person can just leave running.
#9
@Blum_OG
https://x.com/Blum_OG/status/2062249214592036973
A concrete local self-improving loop with a real measured speedup. His argument: self-improving local agents no longer need datacenter hardware now that Hermes Agent, Qwen 3.6 and DGX Spark have converged. Hermes saves completed tasks as plain-markdown skill files in /.hermes/skills/ and reuses them, so after a month every user's agent has diverged from everyone else's. He reports agents with 20+ self-made skills finish similar future tasks about 40% faster than fresh instances, using a three-layer memory (persistent notes, searchable history, procedural skills), and warns Hermes needs at least 64K context while Ollama defaults to 4K. The personalization-through-accumulated-skills angle is the quietly important part.
https://x.com/Blum_OG/status/2062249214592036973
A concrete local self-improving loop with a real measured speedup. His argument: self-improving local agents no longer need datacenter hardware now that Hermes Agent, Qwen 3.6 and DGX Spark have converged. Hermes saves completed tasks as plain-markdown skill files in /.hermes/skills/ and reuses them, so after a month every user's agent has diverged from everyone else's. He reports agents with 20+ self-made skills finish similar future tasks about 40% faster than fresh instances, using a three-layer memory (persistent notes, searchable history, procedural skills), and warns Hermes needs at least 64K context while Ollama defaults to 4K. The personalization-through-accumulated-skills angle is the quietly important part.
#10
@djgelner
https://x.com/djgelner/status/2062188628822913422
A tidy overnight self-improvement pattern worth copying. Every night a general agent reads all of that day's employee-agent conversations and looks for two things, what it could have done better and what context it wished it had up front, then improves itself based on that review before the next day. It's the 'dream cycle' framing, and it's a clean template for any agent fleet that accumulates interaction logs. The unlock is treating a day's worth of transcripts as training signal you can mine for free overnight.
https://x.com/djgelner/status/2062188628822913422
A tidy overnight self-improvement pattern worth copying. Every night a general agent reads all of that day's employee-agent conversations and looks for two things, what it could have done better and what context it wished it had up front, then improves itself based on that review before the next day. It's the 'dream cycle' framing, and it's a clean template for any agent fleet that accumulates interaction logs. The unlock is treating a day's worth of transcripts as training signal you can mine for free overnight.
#11
@Everlier
https://x.com/Everlier/status/2062141021899702685
A real multi-harness architecture, relevant to anyone building autoresearch infrastructure. Their production platform runs multiple pluggable harnesses with a bespoke agentic loop at the core, while also interoperating with Agno, the OpenAI Agents SDK, the Claude Code SDK and Smolagents. The point that matters is that the agentic loop is swappable, you're not locked to one framework's idea of how an agent should run. As the loop becomes the unit of competition, this kind of harness-agnostic plumbing is what lets teams keep experimenting without rewrites.
https://x.com/Everlier/status/2062141021899702685
A real multi-harness architecture, relevant to anyone building autoresearch infrastructure. Their production platform runs multiple pluggable harnesses with a bespoke agentic loop at the core, while also interoperating with Agno, the OpenAI Agents SDK, the Claude Code SDK and Smolagents. The point that matters is that the agentic loop is swappable, you're not locked to one framework's idea of how an agent should run. As the loop becomes the unit of competition, this kind of harness-agnostic plumbing is what lets teams keep experimenting without rewrites.
#12
@MaziyarPanahi
https://x.com/MaziyarPanahi/status/2062231804007129473
A non-coding agent loop that runs entirely on-device for privacy. OpenMed Agent is a terminal-native medical CLI where the whole agent loop lives locally, aimed at individuals who want real intelligence without shipping their health data to the cloud. It's a small post but a meaningful direction, the autonomous loop doesn't have to mean a cloud datacenter, and in regulated domains like medicine, keeping the loop on the device is the only version people will actually trust.
https://x.com/MaziyarPanahi/status/2062231804007129473
A non-coding agent loop that runs entirely on-device for privacy. OpenMed Agent is a terminal-native medical CLI where the whole agent loop lives locally, aimed at individuals who want real intelligence without shipping their health data to the cloud. It's a small post but a meaningful direction, the autonomous loop doesn't have to mean a cloud datacenter, and in regulated domains like medicine, keeping the loop on the device is the only version people will actually trust.
#13
@aug_digitalrain
https://x.com/aug_digitalrain/status/2062157639640056253
An honest negative result, which is rarer and more useful than another win. Using Karpathy's autoresearch harness on a small 4-layer GPT with fixed five-minute trainings scored on validation bits-per-byte, he derived a learning-rate perturbation from the I Ching King Wen sequence's surprise profile and tested three strengths against a random-noise control. King Wen lost at every strength and got worse the harder he pushed, while random noise did fine, so it wasn't disruption that hurt but King Wen's specific high-variance structure whiplashing the optimizer. He concludes a model trained five minutes is too early in learning for habit-breaking to help. This is exactly the kind of cheap, fast, falsifiable experiment the autoresearch harness was built to enable.
https://x.com/aug_digitalrain/status/2062157639640056253
An honest negative result, which is rarer and more useful than another win. Using Karpathy's autoresearch harness on a small 4-layer GPT with fixed five-minute trainings scored on validation bits-per-byte, he derived a learning-rate perturbation from the I Ching King Wen sequence's surprise profile and tested three strengths against a random-noise control. King Wen lost at every strength and got worse the harder he pushed, while random noise did fine, so it wasn't disruption that hurt but King Wen's specific high-variance structure whiplashing the optimizer. He concludes a model trained five minutes is too early in learning for habit-breaking to help. This is exactly the kind of cheap, fast, falsifiable experiment the autoresearch harness was built to enable.
#14
@Kulkunkan_
https://x.com/Kulkunkan_/status/2062260400444375413
A recursive prompt-self-optimization loop, wrapped in some deeply esoteric framing. He shares a self-improving agent system prompt that, he claims, pushed results up to 98% across six fields. Strip away the quantum-consciousness packaging and the actual mechanic is real: an agentic loop (plan, generate, verify edges, optimize) plus a continuous self-refinement step that, after every cycle, proposes vN+1 improvements to its own prompt architecture, treating outputs as versioned promptware with testing and drift detection. The mysticism is noise, but prompts that rewrite themselves each cycle is a pattern that keeps surfacing and is worth taking seriously underneath the costume.
https://x.com/Kulkunkan_/status/2062260400444375413
A recursive prompt-self-optimization loop, wrapped in some deeply esoteric framing. He shares a self-improving agent system prompt that, he claims, pushed results up to 98% across six fields. Strip away the quantum-consciousness packaging and the actual mechanic is real: an agentic loop (plan, generate, verify edges, optimize) plus a continuous self-refinement step that, after every cycle, proposes vN+1 improvements to its own prompt architecture, treating outputs as versioned promptware with testing and drift detection. The mysticism is noise, but prompts that rewrite themselves each cycle is a pattern that keeps surfacing and is worth taking seriously underneath the costume.
#15
@robemart151295
https://x.com/robemart151295/status/2062320655173865729
A compact but pointed meta-research idea. He proposes using agents to auto-research methodologies themselves, so the loop keeps testing newer models as they ship, which would make resulting papers about the method rather than transient results tied to one model. He's basically calling for someone with spare tokens to set agents loose on research methodology. It's a one-liner, but it captures a real direction, the most durable autoresearch target isn't a result, it's a method that survives the next model.
https://x.com/robemart151295/status/2062320655173865729
A compact but pointed meta-research idea. He proposes using agents to auto-research methodologies themselves, so the loop keeps testing newer models as they ship, which would make resulting papers about the method rather than transient results tied to one model. He's basically calling for someone with spare tokens to set agents loose on research methodology. It's a one-liner, but it captures a real direction, the most durable autoresearch target isn't a result, it's a method that survives the next model.
π‘ Eco Products Radar
Eco Products Radar
Tools, frameworks and projects mentioned 3+ times today:
Hermes Agent (Nous Research) β 12
Claude Code β 11
OpenClaw β 6
PrimeIntellect β 4
GEPA β 3
SkillOpt β 3
DGX Spark β 3
Qwen 3.6 β 3
Tools, frameworks and projects mentioned 3+ times today:
Hermes Agent (Nous Research) β 12
Claude Code β 11
OpenClaw β 6
PrimeIntellect β 4
GEPA β 3
SkillOpt β 3
DGX Spark β 3
Qwen 3.6 β 3
Comments