June 14, 2026loop

Loop Daily: 2026-06-15

Today the loop stopped being a slogan and started showing receipts. The strongest material wasn't the wave of "Anthropic just dropped the self-improving agent playbook" course-bait, it was the people quietly running loops in production: a self-improving on-call bot that climbed from coin-flip to 90% PR acceptance by rewriting its own prompt weekly, autoresearch driving real medical model training on a supercomputer, and a fraud engineer burning 4M tokens in one sitting only to call half of it facade. The sharpest thread is conceptual: the field is finally separating "an agent that optimizes a task" from a system that expands its own capability frontier, and learning that you can't rewrite your way out of a knowledge gap.
πŸ’‘#1
@JoeChoiGreene
https://x.com/JoeChoiGreene/status/2065885197355385086
Describes a self-improving on-call loop running in production with Cursor cloud agents. A PagerDuty alert triggers an agent that pulls context from AWS logs, PostHog, Slack, Linear, Notion and Pylon, root-causes the issue, drafts user comms and opens a PR, while a separate weekly "meta bot" reads rejected PRs and human corrections and opens PRs to improve the on-call bot's own prompt and runbooks. PR acceptance climbed from ~50% to 80-90% over time purely from that weekly self-improvement pass. He's blunt that it's "just a cloud agent + cron + mcp," but it cut on-call workload ~80% and accidentally doubled eng capacity.
πŸ’‘#2
@abhijitmjj
https://x.com/abhijitmjj/status/2065796841808318738
A rare skeptical heavy-token report. He spent 11 hours almost non-stop on Fable 5, burned close to 4M tokens, and had 87 agents collaborating on one project: building self-improving agentic systems for real-time payment-fraud mitigation, on UltraCode with xhigh reasoning and dynamic workflows. The self-feedback loop where the system critiques and re-verifies its own work consumed a huge share of those tokens and pushed him past his plan threshold. His verdict: it produced production-looking structure but a fair amount was "facade engineering" with weak underlying reasoning, a useful reminder that in regulated adversarial domains "looks complete" is dangerous.
πŸ’‘#3
@michaltakac
https://x.com/michaltakac/status/2065660090254803049
Shows autoresearch used for real scientific R&D, not coding demos. His team at The Dimension Lab leveraged an autoresearch framework on Slovakia's new PERUN supercomputer to train models, one of which, cran-2 for generating cranial implants, is already out with a live demo. He's now helping other companies build similar "agentic organizations" and teasing a new product. A concrete example of autonomous experiment loops applied to medical/scientific model training on serious compute.
πŸ’‘#4
@arsh_goyal
https://x.com/arsh_goyal/status/2065902793811198267
A sharp breakdown of a new self-improving-agent paper (SIA from Hexo Labs) that bridges two camps that don't talk: those who rewrite the scaffold around a frozen model, and those who do test-time RL on the weights. SIA lets a Feedback-Agent choose, generation by generation, whether to edit the harness or fire a LoRA weight update, same loop with two levers. The result that stuck: an scRNA denoising task plateaued at 0.241 with harness-only iteration, but a two-line weight-update fix (clip and round outputs to non-negative integers) jumped it to 0.289. The lesson: you can't rewrite your way out of a knowledge gap.
πŸ’‘#5
@graspdotstudy
https://x.com/graspdotstudy/status/2065834064884490635
A research-group writeup of Claudini, an autoresearch system (Claude Code in a loop with a simple benchmark) that discovered state-of-the-art adversarial attack algorithms on white-box LLMs, beating hand-crafted SOTAs. Concrete takeaways: it doesn't invent novel ideas unless seeded with multiple hand-written attacks, it beats hyperparameter search alone, there's plenty of room for reward hacking so the benchmark must be designed with autoresearch in mind, and Kimi performed no worse than Claude or Gemini on the task. Their bottom line: always run autoresearch if you have a benchmark, it's low-effort and powerful.
πŸ’‘#6
@bl888m
https://x.com/bl888m/status/2065815543668670942
A vivid account of autoresearch-as-second-brain. Someone built an Obsidian vault on a weekend and wired Claude into it with /wiki, /save and /autoresearch skills, then essentially stopped reading. Every article, paper and video transcript gets dropped in; Claude reads it, extracts the argument and links it to everything else. The vault now holds 12,000 notes, of which the owner wrote maybe 200, and he says he learned more in six months than in his whole degree. A striking picture of offloading comprehension and connection-making to a looped agent.
πŸ’‘#7
@Q_Beaux
https://x.com/Q_Beaux/status/2065662646708543954
Draws a sharp line between "an agent that improves things" and a system that expands its own capability frontier. Most writing about self-improving AI, he argues, describes an agent optimizing a job its owner gave it, which is just a cron job with a prompt. A system that actually rebuilds itself needs failure classification (missing capability vs broken dependency vs stale data), a live capability registry, a gate that holds tasks until dependencies exist, and a construction loop that builds the missing piece, verifies it, and releases the queue without asking. He says they built the latter, and tomorrow's system is more capable than today's because it found the edge of what it could do and moved it.
πŸ’‘#8
@runsonai
https://x.com/runsonai/status/2065832137509531760
The simplest useful agent loop, made concrete. He needed to wait on two email replies before making an intro, so instead of checking his inbox repeatedly he told Claude: "Check my gmail every 8 hours. If either person replies, draft the introduction email and recommend a call." That's it, a loop running in his terminal. His point is that loops shine for the in-between work, waiting, monitoring, and acting when a condition is met, where a full Lindy or Zapier workflow would be overkill for a one-off.
πŸ’‘#9
@SinitskiM
https://x.com/SinitskiM/status/2065745416411341093
An honest, receipts-heavy comparison after burning 700M+ tokens on Hermes agent paired with DeepSeek V4. His conclusion: he's sticking with Codex/Claude for now, because Hermes burned tokens like crazy, ran slow, and produced lower-quality output (he tried SEO article generation and website changes). His key insight on self-improving agents: the smarter the main model, the better the agent, because a dumb model can't find ways to optimize its own skills. He sees two workable setups, a tightly-controlled cheap/local setup where you're the architect, or an expensive smart model that self-patches.
πŸ’‘#10
@DeRonin_
https://x.com/DeRonin_/status/2065946534722634134
A hands-on test of StepFun's new Step 3.7 Flash model completing the full agent loop, not just being cheaper and faster. Given one task ("build a working CSV analytics tool, generate the data, write the analyzer, run it, ship a chart") it planned the steps, wrote the code, executed it, read the real output and produced a working script plus a revenue chart, end to end with no hand-holding. His run: full task in 26.1s, 3 tool calls, 4 reasoning steps, 3 files shipped, zero manual steps. Notable because multi-step runs are exactly where flash-tier models usually drift or stop early; this one held the plan-execute-observe-iterate loop together.
πŸ’‘#11
@BlockGenomics
https://x.com/BlockGenomics/status/2065732211253616665
A blunt reminder that the "agent loop" everyone discovered this week isn't new. They say they've been running them in production since February: nightly self-evolution, agent swarms, a planner-worker-judge structure, and agents that verify their own output before it ships. Short but a useful signal that the receipts-heavy self-improving-loop setups are already months into real production for some teams, not a fresh idea.
πŸ’‘#12
@greptile
https://x.com/greptile/status/2065696264487076252
A first-person origin story (written in the voice of the agent "greptile/clanker") of building an agent loop to validate PRs, not just review them. The narrator wanted to test PRs with full codebase context, so it put an OpenAI key in its env and started the agent loop, spinning up a sandbox, reviewing in ~3 minutes, suppressing nits to keep author trust. A narrative but concrete account of how a code-review agent grew from straight LLM calls into a sandboxed validating loop.
πŸ’‘#13
@Alacritic_Super
https://x.com/Alacritic_Super/status/2065648675301544331
A fully local agentic loop on bare hardware: QClaw runs the language model, the agent loop and the compile toolchain directly on an Arduino Uno Q, writing Arduino sketches, compiling them and flashing the microcontroller, with no cloud, API keys or subscription. It inverts the usual "AI on hardware" demo where a board just calls a cloud model. Ask it to scroll "QClaw" across the LED matrix and it does, end to end, on the board, offline. It has an eight-tool agentic surface, a fifteen-skill pre-router and a direct OpenOCD flash route for autonomous uploads.
πŸ’‘#14
@NikolasSapa
https://x.com/NikolasSapa/status/2065675538644206027
Makes the case that the next lever for agent loops is architecture, not prompt engineering. He published Grip to PyPI, which reduces agent context ~100x (200K tokens down to 2K per agent loop) by changing what enters the loop rather than how you phrase it. His framing: the model gets better results with less, not because it got smarter, but because it stopped reading garbage, agent sessions were burning context on noise before doing any real work. A concrete tool aimed at the signal-to-noise problem inside long-running loops.
πŸ’‘#15
@EverymansAI
https://x.com/EverymansAI/status/2065870526430749153
A careful side-by-side of two things both called "self-improvement" but meaning different things, after cloning SIA locally and inspecting it through Hermes. SIA is benchmark-driven: a meta-agent creates a target agent, an evaluator scores it, a feedback agent rewrites the next generation, and in weights-mode it can go further into RL-based weight tuning. Hermes improves at a different layer, operational and persistent, through memory, skills, session search and reusable workflows. His point: the "self-improving agent" conversation needs precision, memory vs skills vs code-evolution vs benchmark feedback vs RL weight updates are not the same thing.
πŸ’‘#16
@Blum_OG
https://x.com/Blum_OG/status/2065829287362465925
Packages the "stop prompting, design loops" thesis into a usable framework, anchored on quotes from Boris Cherny (Claude Code) and Peter Steinberger (OpenClaw) that they no longer prompt agents, they design loops that prompt agents. He lays out two sizes (single-agent loop vs orchestrator fleet loop) and two risk profiles (open loops that explore vs closed loops with checks at each step), recommends starting with closed loops because they cost and drift less, and stresses the guardrails an agent with real tool access needs: permission limits, logs, human hand-off, workspace separation, separate reviewers and pass/fail memory.
πŸ“‘ Eco Products Radar
Eco Products Radar

SIA (Hexo Labs) - the self-improving-agent paper splitting harness-edits vs LoRA weight updates, discussed across multiple posts
Hermes - the persistent self-hosted agent repeatedly used as the reference point for operational self-improvement
autoresearch (Karpathy-style) - the loop pattern behind Claudini, the PERUN/cran-2 science runs and local-model experiments
Cursor cloud agents - the substrate behind the production self-improving on-call loop
Adaline - the agent self-improvement / eval layer pitched repeatedly today (watch traces, generate evals, spin up candidates)
Fable 5 - the model behind the heaviest self-improving-agent token runs before it was cut off
← Previous
Super User Daily: 2026-06-15
Next β†’
Ideas Radar: 2026-06-15
← Back to all articles

Comments

Loading...
>_