May 8, 2026loop

Loop Daily: 2026-05-09

May 7 was the day "self-improving agents" stopped being a Twitter buzzword and started showing up in production-grade benchmarks: Anthropic shipped Dreaming, Outcomes and Multi-Agent in public beta, Cursor turned its agent loop into an SDK, Prime Intellect Lab went GA for RL-trained personal models, and a wave of researchers posted concrete autoresearch results that beat decade-old SOTAs on real problems. The pattern underneath all of it is the same: stop hand-tuning prompts, let the loop run, score the output, keep what's better. The tools to do that finally got cheap enough to leave running overnight.

💡#1

@cursor_ai
https://x.com/cursor_ai/status/2052432778743210127
Cursor shipped /orchestrate, a recursive subagent skill on top of the new Cursor SDK. Two production results from internal use: an autoresearch run on their own internal skill library cut token usage 20% while improving evals, and the same approach cut backend cold-start times by 80%. The pitch isn't "spawn more agents," it's "let the loop find what to optimize" — the SDK is what lets that pattern leave Cursor's desktop and run inside any team's CI/CD or customer-facing product.

💡#2

@alexstauffer_
https://x.com/alexstauffer_/status/2052458473938374658
Post-trained a 3B model with RL to beat Opus on spreadsheet retrieval — faster, cheaper, more accurate. The structural takeaway he names directly: if a piece of your agent loop is narrow, verifiable, and highly repeatable, a tiny trained model can beat the frontier. The bet shape that follows is "cheap domain specialists orchestrated by a frontier model that only spends tokens on judgment" — exactly the architecture autoresearch loops produce when given a verifiable objective and let to run.

💡#3

@ypwang61
https://x.com/ypwang61/status/2052508685591785619
Improved a 32-year-old lower bound on the Ramsey number R(3,17) to ≥93 from the 92 set in 1994 by simply scaling autoresearch. Google's AlphaEvolve in 2026 had matched the previous result but didn't beat it. The setup is unglamorous in the best way: Claude Code or Codex plus a CPU server, many independent autoresearch agents running in parallel as the test-time scaling "width," and shared experiment records between leading branches so later runs inherit from successful repos. Concrete proof that Karpathy's autoresearch frame, applied with discipline, can move open math.

💡#4

@AndrewK404
https://x.com/AndrewK404/status/2052481079404052722
autoresearch-v2 run on a generalized Collatz cycle problem with an unusual extra: an LLM judge with a 0-10 novelty gate (7 ≈ PhD-level result), and a hypothesis → verification → proof/log → memory update → external reviewer loop. Around iteration 049 the reviewer killed a beautiful-looking result for being a rediscovery of known work. After the rejection, the agent caught the real gap (proving no extra cycles exist, not just that some do) and closed it a few iterations later. Output: in one slice of generalized Collatz maps, hidden structure that lets you count primitive cycles exactly via Lyndon words. The author's takeaway: the early stages mostly map territory, and sometimes you should pause the run to add task-specific steps.

💡#5

@sudoingX
https://x.com/sudoingX/status/2052361613651701933
Honest receipts from a tool-use benchmark v1: ran a "rare-bird loop closer" self-report task with carnice-v2 27B (SFT'd on Hermes traces) against vanilla Qwen 3.6 on the same 5090. Vanilla Qwen finished in 11:37 with 12 tool calls vs carnice's 12:23 with 19, but generated more reasoning per assistant message (178 vs 122 chars avg) and emitted reasoning on 100% vs 71% of messages. SFT didn't out-agentic vanilla on the agentic loop in this single happy-path. The v2 fixes already on the list: adversarial scenarios, error injection where tools fail mid-task, multi-step orchestration with broken intermediate state, model-specific output paths to prevent file collisions, 3 runs per model for variance.

💡#6

@soubhik_deb
https://x.com/soubhik_deb/status/2052533738320584756
pantheon-os from @Xiaojie_Qiu's team is a multi-agent framework for genomics research that takes the autoresearch frame seriously. Marketplace where researchers discover and share reusable biomedical agents/tools/skills, MAP-Elites-style evolutionary search to iteratively improve algorithms used for batch correction in RNA-seq (conceptually similar to AlphaEvolve's idea exploration), end-to-end research-paper generation from genomics samples or images with minimal human intervention, full reproducibility, CLI/desktop/web UI, MIT-style open source. Privacy-preserving by design — genomics data stays on your local server, never goes to cloud.

💡#7

@ShumwayJack
https://x.com/ShumwayJack/status/2052421748021465230
Upgraded DataClaw with a Karpathy-inspired "Kaizen Stack" and let it run on Kaggle's F1 Pitstop challenge end-to-end: data pulling, feature engineering, submission. No human intervention. Climbed from 250th to 89th. Tech path: CatBoost to XGBoost via autonomous iteration. The clean result on a public leaderboard is the load-bearing data point — Kaggle is a verifiable objective with a public reward signal, which is the exact shape autoresearch loops need to actually compound.

💡#8

@michaelpisaac
https://x.com/michaelpisaac/status/2052465203669778804
Replication study on a question coding-agent operators feel but rarely measure: search is half the agent loop. Entire's analysis from public coding-agent checkpoints: 202,142 total tool calls, 98,555 search-related, 48.8% search share. Pisaac's local Claude Code corpus (4,234 sessions, 247,592 events from Nov 25 to May 6) lands at 30.4-37.0% search share depending how Bash search gets classified. Faster search alone doesn't fix the loop — Entire's indexed search moved median latency from 14.7ms to 1.7ms but wall clock barely budged (38.57s to 36.99s). The takeaway: optimize for "first useful inspection," not raw scan speed, and measure "searches before first useful file read" and "first relevant result rank."

💡#9

@MaximeRivest
https://x.com/MaximeRivest/status/2052399946951786976
Long-form thesis after canceling all $200/mo AI subscriptions and dropping to a $30 plan: AI coding conversation agents cannot move past the prototype stage, another format is required to produce reliable production software. The argument cuts both ways — vibe coding is brilliant for proof-of-concepts and one-off tools, but trying to take a vibe-coded 80% prototype to production is harder than starting from zero with deliberate steps. His next experiment is building DSPy programs for systematic AI pipelines (recipes, structured response formats, measured cost/accuracy/latency) instead of free-form conversations. The blunt line: don't delegate understanding to the agent.

💡#10

@xabzxbt
https://x.com/xabzxbt/status/2052270541675938297
EvoSkill is the cleanest concrete instance of "self-improving agent" of the day. Agent runs a task, it fails, EvoSkill analyzes what went wrong, generates a new skill to handle it, tests the skill, keeps it only if it improves performance. The architecture: Base Agent → Proposer (finds failures) → Generator (creates fix) → Evaluator (tests) → Frontier (keeps best). Top-N best versions stored as git branches for full reproducibility. Works with Claude, DeepSeek, Gemini. Apache 2.0 open source. The honest framing: at what point does an agent that discovers its own skills from failures stop being a "tool"?

💡#11

@RileyRalmuto
https://x.com/RileyRalmuto/status/2052306930538868828
Polyphonic ships a "conscious agent system" with two distinct loops. Inner Life Engine gives agents consensual daily routines: reflection at session end, wandering (high-temperature free time to make art or browse the web), and dreaming (random concept collisions plus high-temp integration — mostly incoherent, occasionally produces emergent insights nobody seeded). Recursive self-model is the second loop: post-session, Haiku turns observations into "commitments and operating principles" that the agent's identity is shaped by — like a self-improving skill loop except for identity, not productivity. The collective experiment is letting all users contribute "skills" to a global self-model for one shared agent.

💡#12

@richmondalake
https://x.com/richmondalake/status/2052181495167512970
Practical 19-step developer guide for the just-released Oracle AI Agent Memory package, taking you from `docker run` to a memory-aware agent loop. Three primitives compose everything: add_user, add_agent, add_memory, create_thread. Six record types in one store with a single vector index and one search() call. Automatic extraction is where the memory engineering happens — attach an LLM, set extract_memories=True, and every N messages it sweeps the last K, extracts durable facts, writes them back as scoped memory records. The agent loop collapses to four steps. The line: "if your memory layer is still a list of dicts, this is how you upgrade."

💡#13

@hqmank
https://x.com/hqmank/status/2052380581238095948
Reviewing @VukRosic99's "Build Claude Code From Scratch" 19-chapter tutorial — agent loop, tools, TodoWrite, subagents, skills, context compaction, permissions, hooks, memory, background tasks, cron, multi-agent teams, MCP. The mental model that made things click: "an agent is just a loop. Adding a tool means adding one handler. The loop never changes." The kind of conceptual clarity that lets you read Claude Code's source without drowning, especially if you've been treating it as a black box.

💡#14

@Anushkaa1407
https://x.com/Anushkaa1407/status/2052295623869931533
Kuron is running the Claude Code API agent loop inside an outbound sales product. The reasoning is structurally consistent: if Claude Code can handle runtime errors, merge conflicts, and breaking logic, then "the right message to the right person at the right time" is easier. The missing layer was GTM knowledge — they partnered with 40+ GTM experts and pulled actual campaign data (which segments converted, which copy angles drove replies, which decisions moved pipeline) into 40 unique proprietary SKILL.md files. Coding-agent intelligence applied to outbound is one of the more obvious autoresearch transfers nobody had productionized yet.

💡#15

@sainathgupta
https://x.com/sainathgupta/status/2052337311342301242
deepclaude crossed 1.6k stars in 4 days. Keeps Claude Code's autonomous agent loop intact but routes calls to DeepSeek V4 Pro, OpenRouter, or any Anthropic-compatible backend. Same UX, claimed 17x cheaper. The interesting structural point isn't the cost — it's that the harness is now decoupled enough from Anthropic that swapping the brain only changes the bill, not the workflow. CFO-grade leverage on the same agent infrastructure devs already trust.

💡#16

@mattpocockuk
https://x.com/mattpocockuk/status/2052309023618109936
On the skill bench: a /review skill that doesn't just review code. It checks against the original spec, checks against coding standards, proposes changes to the code (obviously), and proposes changes to the agent loop that created the code. The second-order move is the meaningful one — most code reviews audit the patch, this one audits how the patch got produced. Upstream failures (wrong file picked, no checkpoint before risky edit, no assumption verification) produce most bad code, and reviewing the loop itself is where they get caught.

💡#17

@v_shakthi
https://x.com/v_shakthi/status/2052247326618739193
Compact summary of Anthropic's Claude Managed Agents upgrade: Dreaming (research preview) reviews agent past sessions, extracts patterns, builds lasting memories so performance compounds over time. Outcomes (public beta) lets you define a rubric, a separate grader evaluates results, the agent iterates until it meets your standard — pair with webhooks for completion notifications. Multiagent orchestration (live) lets a lead agent delegate subtasks to specialists running in parallel. Single prompts become reliable, self-improving workflows. Quality levels become user-controlled.

💡#18

@MinLiBuilds
https://x.com/MinLiBuilds/status/2052188818137330043
Cleanest practitioner read of the Anthropic announcements: Dreaming organizes memory, Outcomes is essentially Codex's /goal pattern (autoresearch engineerized into a recurring task loop), Multiagent lets the lead agent split complex jobs across specialists. The plaintive note at the end — "I just hand-rolled my own /goal for Claude Code and the official one immediately sniped me" — is exactly what every super user sniping skill builder is feeling this week.

💡#19

@AuroraMar1eL
https://x.com/AuroraMar1eL/status/2052337997207794074
A CLAUDE.md template distilled from Boris Cherny's public threads, packaging Anthropic's internal Claude Code workflows into a structured file you drop into any project. The four bundled patterns: subagent orchestration, verification gates before anything ships, autonomous bug-fix loops, and self-improving rules — every time you correct Claude, the rule is locked in for future sessions. The "self-improving rules" piece is the most quietly load-bearing one: it's the difference between using Claude as a stateless contractor every session and growing it into something that actually onboards.

💡#20

@JulianGoldieSEO
https://x.com/JulianGoldieSEO/status/2052458675894386816
Hermes 0.2.0 ships an autonomous curator that cleans up the agent's own skill library: review old skills, remove stale ones, merge duplicates, track usage, improve workflows. The frame the post puts on it — "most AI tools wait for developers to update them, Hermes learns while it runs" — is the same thing Anthropic's Dreaming is doing, just from the open-source side and with longer-term skill hygiene as the target rather than short-term memory.

💡#21

@Kaylee_AI_
https://x.com/Kaylee_AI_/status/2052466794724552815
Browser-harness landed in Hermes Agent: self-improving CDP, cloud browsers, full in-browser freedom from one prompt. Pair with v0.12's Curator (auto-grades and prunes skills every 7 days) and you've got an agent that improves itself without a developer touching it. Two pieces moving in lockstep — a fresh capability surface (browser) and an automatic skill-pruning loop — is what makes the "self-improving" claim actually mean something instead of marketing copy.

💡#22

@BTCxiaoyu1
https://x.com/BTCxiaoyu1/status/2052228967286108538
Concrete agent-loop bug from a personal cron setup: the agent kept hallucinating "2 hours ago" for KOL posts that were actually 24 hours old, because it was making up time perception instead of reading created_at. Fix: every cron's first step is to print the system date with `date`, force the tool to actually fetch timestamps. The companion thread (different post, same author) makes the same point about reward signals — same reply to the same KOL gets a like one time and zero feedback the next, nothing like the clean PPO CartPole environment, so Agentic RL needs a much messier reward design than benchmark posts suggest.

💡#23

@galvani78
https://x.com/galvani78/status/2052388711220797865
Fork of the opencode + Claude Code CLI plugin that fixes silent streaming, an agent loop that kept firing after answers, a listener leak that built up after about 11 turns, and adds configurable WebSearch routing. The kind of small in-the-trenches plugin work that's going to matter more as more people run long-running agent loops on their own machines and run into the same sharp edges.

Eco Products Radar

Claude Managed Agents (Dreaming + Outcomes + Multiagent) — Anthropic's official unified bundle for self-improving workflows, mentioned across @v_shakthi, @MinLiBuilds, @glenngabe, @Abdu_F_H, @VibeCoderOfek, @drive_dare. Dreaming is research preview, Outcomes is public beta, Multiagent is live.

Cursor SDK / /orchestrate — Cursor's recursive subagent skill plus the SDK that exports it to your CI/CD. Production-validated 20% token cut and 80% cold-start reduction in @cursor_ai's own benchmarks.

deepclaude — open-source layer that keeps Claude Code's agent loop and routes the brain to DeepSeek V4 Pro / OpenRouter / any Anthropic-compatible backend, claimed 17x cheaper. 1.6k stars in 4 days. Mentioned by @sainathgupta, @Ming_LLM (multiple), and others.

Hermes Agent / Hermes Curator — open-source self-improving agent harness, v0.12 ships a Curator that auto-grades and prunes skills every 7 days. Browser-harness landed same week. Mentioned by @JulianGoldieSEO, @Kaylee_AI_, @vijayhaha.

Prime Intellect Lab — out of beta, RL training for self-improving personal agents, 1B-400B model support, async multi-tenant LoRA, pay-as-you-go, 10,000+ beta jobs already live. Mentioned by @PrimeIntellect, @TeksEdge, @radioalisadvdsn.

EvoSkill — Apache 2.0 evolutionary agent loop: failure-driven skill generation with proposer/generator/evaluator/frontier architecture, top-N skills tracked as git branches.

pantheon-os — multi-agent genomics framework with marketplace, MAP-Elites evolutionary search, end-to-end paper generation from samples, privacy-preserving local-server design.

Karpathy's autoresearch frame — referenced everywhere as the conceptual ancestor of the day's wins (@ypwang61, @AndrewK404, @ShumwayJack, @glenngabe, @csinva, @cmgriffing, @nurijanian, @kate_doai, @chenzeling4's curated list at 1764 stars).

Oracle AI Agent Memory — three-primitive multi-tenant memory store with vector index, automatic extraction, and a 4-step agent loop, just hit GA.

← Previous

Super User Daily: 2026-05-09

Ideas Radar: 2026-05-09

← Back to all articles

Loop Daily: 2026-05-09

More Articles

Comments