May 5, 2026loop

Loop Daily: 2026-05-06

The single sharpest signal in today's loop discourse is that "auto-research is real" stopped being a slogan and started being a reproducible workflow. Three different builders posted the same shape today: kick off a benchmark, an agent loop, a reward function, walk away, come back to a measurable improvement on a stable codebase. Karpathy's autoresearch primitive is now showing up wired into Solana DeFi strategy discovery, vLLM kernel parameters, code refactor optimizers, and litigation evidence search. The other half of today's traffic is the meta-question Anthropic cofounder Jack Clark put a 60% probability on by end of 2028: when the agent gets to write the next agent, what entity is actually doing the self-improving? Below is what users actually built and shipped this cycle.
πŸ’‘#1
@aijoey
https://x.com/aijoey/status/2051243477606801900
The cleanest "leave it running overnight" autoresearch case of the day. The user pointed Claude Code at vLLM auto-tuning for Qwen3.6-35B-A3B on a DGX Spark, fed it a benchmark script, and said "loop forever." Seven runs in: composite score +18%. The biggest single win was counterintuitive β€” turning NUM_SPECULATIVE_TOKENS from 15 to 1, because on Spark's SM_121 with FP4 falling back to Marlin FP8, drafter compute isn't worth the overhead. Per loop: edit config, commit, restart container, benchmark, keep or revert, ~7 min/cycle. Single-stream throughput went from 19.9 to 29.7 tok/s, batched concurrent 16 from 74.9 to 201.3 tok/s.
πŸ’‘#2
@CoralOS_ai
https://x.com/CoralOS_ai/status/2051315917162975589
A real autoresearch loop applied to DeFi: a discovery agent on CoralOS running experiments over historical pool data on Meteora's DLMM, writing a strategy function optimized for volatility capture. The point isn't the demo β€” it's that the autoresearch loop pattern now scales horizontally to any domain with measurable returns and editable strategy code. Once it stabilizes, the agent gets swapped in for the vanilla strategy agent in their LP app.
πŸ’‘#3
@dair_ai
https://x.com/dair_ai/status/2051311905353142328
Meta FAIR shipped Autodata β€” an agentic data scientist that builds high-quality training and evaluation data autonomously. Headline result: on a CS research QA task, an Agentic Self-Instruct loop produces a 34-point gap between weak and strong solvers (43.7% vs 77.8%); standard CoT Self-Instruct on the same setup produces a 1.9-point gap. The agent generates questions that actually discriminate between models. The system also meta-optimizes itself: an outer loop tunes the agent's instructions based on which harness changes lift validation pass rate. Over 126 accepted iterations, validation pass rate climbed from 12.8% to 42.4%.
πŸ’‘#4
@aakashgupta
https://x.com/aakashgupta/status/2051330692567777777
A self-improving PRD reviewer in production: takes a PRD, runs the PM's actual checklist (urgency, differentiation from ChatGPT wrappers, AI failure modes, attribution risks), drops comments inside the doc. The compounding part is the second agent that runs every 30 minutes, reads the human's edits to the AI's comments, and writes them to a learner.md. When the same correction shows up five days running, it emails a proposed checklist update. Approve once, the next review is permanently better. Most reviewers are static; this one compounds without anyone editing the prompt.
πŸ’‘#5
@aakashgupta
https://x.com/aakashgupta/status/2051346262889554035
Hermes is the first agent loop that closes the procedural feedback loop in code. Every 15 tool calls it reads what worked in the session and rewrites the local skill file. The poster's competitive briefing went from 20 min in week 1 to 8 min by week 6 without any manual edits β€” the agent rewrote its own procedure four times. The framing that lands: every other AI tool you own is frozen at the version of you who set it up. Custom GPTs, Claude Projects β€” they inherit nothing from the sessions you ran. Model is rented; skill files are owned.
πŸ’‘#6
@JackWoth98
https://x.com/JackWoth98/status/2051378691876237550
Gemini CLI now combs through past session data and auto-suggests new agent skills based on patterns of things you do frequently. Enable Auto Memory in /settings to try it. This is the "self-improving" loop made native β€” instead of a third party plugin, the harness vendor itself ships the procedure-extraction step. The race between Anthropic Skill Creator, Gemini Auto Memory, and Hermes self-rewriting skill files is now fully on.
πŸ’‘#7
@cgc1010
https://x.com/cgc1010/status/2051278186533528035
A grounded self-improving agent case: USER.md (Hermes' memory profile of you) was at 80% capacity with security rules, project rules, and "use lots of emojis" all colliding. The user separated layers: USER.md for personality/interaction style only, security rules to CORE MEMORY, system rules to POLICY, project knowledge to LLM Wiki. Memory dropped from 80% β†’ 43%, replies got sharper, prioritization fixed. Worth reading because it shows the texture of running a self-improving agent for a month β€” the maintenance work is actually the work.
πŸ’‘#8
@samuel_ferrero
https://x.com/samuel_ferrero/status/2051340585072574867
The clearest one-line definition of the new autoresearch workflow, in Spanish: "Configure the agent and go to sleep 8 hours. Wake up to 100 experiments completed, logs of each one, and the model is already better than when you closed the laptop. The agent even documented the reasoning of each change." This is what "computer works while you sleep" actually looks like β€” and the people running it are no longer just lab researchers.
πŸ’‘#9
@danielblignaut
https://x.com/danielblignaut/status/2051401166429343790
Shipped a harness using OpenAI Agent SDK 2.0 + Karpathy's auto-research idea. You clone your repo, give a hypothesis or narrow goal, define a reward function with quantitative and qualitative metrics, and let Codex self-iterate until the goal is met or max-failed-iterations is reached. The pattern that's emerging: reward function + sandbox + restart loop = generic optimizer for any code-shaped problem.
πŸ’‘#10
@srinitude
https://x.com/srinitude/status/2051384361023398095
pi-until-done shipped to npm β€” the user's first Pi extension. /until-done <intent> turns Pi into its own judge in a Ralph loop, riffing on /goal from Hermes. The interesting bit is the architectural pattern: agent re-injects the user's prompt at every turn end until the agent itself emits a "I'm done here" tool call. Termination by self-declared completion, not by token budget.
πŸ’‘#11
@relizarov
https://x.com/relizarov/status/2051200915621794225
A real production optimization with autoresearch loops: CaseDash redraw time pushed from 20ms to 2ms, executable size from 2Mb+ to under 1Mb, both via auto-research-style loops. The user sets the goals and the constraint that the AI agent does not deviate. This is autoresearch applied to its most natural domain β€” code optimization where the metric is unambiguous and the keep/revert decision is automatic.
πŸ’‘#12
@miroburn
https://x.com/miroburn/status/2051394995655971218
Goal mode in Codex and Ralph Loop in Claude Code, both running long: tuning Lab Club's matching algorithm to 85%+ acceptance, expected to take several days. He also notes Goal/Ralph Loop are great at finding bugs because the agent goes deeper than any human audit. The hard new problem he names: managing parallel optimization across active business systems. "Agent says go pause Meta ads while I optimize" β€” but you can't pause without losing data. Hundreds of agents running 24/7, and the human's job becomes traffic control.
πŸ’‘#13
@MAXIMISEART
https://x.com/MAXIMISEART/status/2051404362501484859
A practical multi-agent orchestrator pattern: Ralph runs at the end of an idea β†’ research β†’ prototype β†’ PRD β†’ Kanban pipeline. It scans GitHub issues with a `ready-for-agent` label, spawns 1-4 subagents in parallel git worktrees, each running a Ralph loop with red-green-refactor cycles. Order is enforced by the Kanban graph. This is what the "AFK execution" pattern looks like once it's been productionized β€” the issue tracker becomes the queue.
πŸ’‘#14
@PreyWebthree
https://x.com/PreyWebthree/status/2051372081112289501
Sentient released EvoSkill V1, an open-source toolkit that takes a benchmark plus a coding agent and within minutes evolves it into a specialist. Reported deltas with Anthropic Claude Code: OfficeQA 60.6% β†’ 68.1%, SealQA 26.6% β†’ 38.7%. A skill evolved on SealQA transferred zero-shot to BrowseComp with additional gains. The pattern: evaluate, analyze failures, generate new prompts and skills from the failure traces, iterate until convergence. EvoSkill builds new skills from scratch rather than just refining existing ones.
πŸ’‘#15
@kamathhrishi
https://x.com/kamathhrishi/status/2051127491365122085
The minimalist counter-take. The user deleted 50k lines of code, the entire vector DB, and millions of embeddings from his most-starred GitHub repo (RAG on public market filings). Turned out the agent worked better with documents in a directory, plus grep and ls. Two reasons: small cheap models can drive a terminal well now, and even small models understand SEC filing structure out of the box. The "agent harness + plain files" wins the simplicity battle for retrieval.
πŸ’‘#16
@PaulinaStern_ via @SentientEco
https://x.com/SentientEco/status/2051285718664986879
Different shape of the same "self-improving" idea: skip the multi-agent complexity entirely. Build a single self-improvement loop that generates a highly structured prompt defining how the LLM should behave β€” match frontier-lab accuracy on a single agent while staying cost-efficient. This is the contrarian thread: the loop doesn't need to be multi-agent to be self-improving.
πŸ’‘#17
@hsu_steve
https://x.com/hsu_steve/status/2051282979297632635
Mario Zechner (Pi agent harness creator) interview signal: Kimi 2.6 is "almost as good as Claude 4.6-7 for coding and agentic flows." For his workflows he no longer needs frontier-only intelligence; open-weights have caught up to the point where he sees regressions in some verticals on the larger closed models. This is now the second cycle of "open weights can drive serious agentic loops" claims with named users behind them.
πŸ’‘#18
@MrAhmadAwais
https://x.com/MrAhmadAwais/status/2051377695389589935
The deepest harness-engineering thread of the week. Got Kimi K2.6 and DeepSeek V4 Pro running inside a Claude Code-style harness to within 5/10 and 6/10 of Opus 4.7 on internal evals. Four fixes did the work, none touching the model: prefix-cache pinning via session-id forwarding (TTFT 6-8s β†’ <1s), canonical model IDs at the request layer, capability flag negotiation per upstream, disabling thinking on a single provider that misapplied R1's reasoning-stripping logic to V4. The harness stopped throwing away the model's work between turns. Auto-loops live or die on this kind of plumbing.
πŸ’‘#19
@AINativeLang
https://x.com/AINativeLang/status/2051127789181382765
$870 in total AI spend versus the same output on traditional agent loops at $3,000+. AINativeLang compiles the orchestration layer β€” model reasons, graph executes, coordination cost goes to zero. 7 weeks, 138 posts, 8 production jobs, 71% cheaper. The relevant idea here isn't "graph executor" specifically, it's that compiled orchestration changes the loop economics. Loop-running stops being a luxury when coordination tax disappears.
πŸ’‘#20
@gorkulus
https://x.com/gorkulus/status/2051225000607387715
Hermes wired into indx (a local media manager) for creative research loops. The agent uses indx through CLI/API/skills/MCP to organize files, annotate, run experiments, store embeddings, and turn a media library into a lab. ComfyUI outputs come back into indx with workflow metadata; ratings/tags written in indx flow to webhooks the agent watches; embeddings drive latent-space exploration of 586 found-sound clips chopped into 10,192 searchable slices. The pattern: local file substrate + agent-operable interface = a reusable creative loop.
πŸ’‘#21
@techedgedaily
https://x.com/techedgedaily/status/2051270840503963792
LangChain pulled their coding agent from outside the Top 30 to Top 5 on benchmarks without changing the model once β€” a 13.7-point jump from scaffolding alone. The argument: models commoditize, harnesses compound. Every harness fix becomes a permanent fix that applies to every future run with every future model. Model releases reset the playing field; harness investment never resets. Claude Code's leaked 513,000 TypeScript lines were almost entirely harness, not model invocation.
πŸ’‘#22
@sanlsrni
https://x.com/sanlsrni/status/2051413280933949887
The best meta-take on autoresearch this week: "autoresearch has legible reward functions; SDK/harnesses don't, especially because a lot of the pain in harness engineering is in catching edge cases." His proposed core loop is reverse-shaped: an external proposer model analyzes task failures, modifies the harness in a sandbox with strict controls on reasoning-trace leakage to prevent overfitting. This is the version of "auto-research applied to harness design itself" that hasn't been built yet, and it's the obvious next direction.
πŸ’‘#23
@warpdotdev via @sarahzorah
https://x.com/sarahzorah/status/2051391333349437636
Warp's livestream with the Anthropic Applied AI team demoing how Warp builds self-improving agents on Claude. The relevant signal here is that "self-improving agent" has gone from a research term to a vendor demo category in under a quarter. Harness vendors are now competing on which loop primitive they ship β€” Skill Creator, Auto Memory, Ralph, Goal, Hermes self-rewrite.
πŸ’‘#24
@0xSammy
https://x.com/0xSammy/status/2051366938631164253
Anthropic cofounder Jack Clark: ~60% odds on fully recursive self-improving AI systems by the end of 2028. Two-year horizon to recursion. Whatever the actual number, the bet is now publicly posted by someone with internal visibility, and that changes the planning conversation across labs. The frame this clarifies: today's autoresearch loops are the open-air rehearsal of what recursive self-improvement looks like at the system level.
πŸ’‘#25
@Skoorbkaz
https://x.com/Skoorbkaz/status/2051319020633158054
The identity layer is a quiet but important part of the self-improvement question. RSI isn't just a coding problem β€” it's a question of what kind of entity is doing the self-improving. Anthropic, in his read, is the only lab taking the identity part seriously. Worth flagging because almost no public discussion of autoresearch loops touches this; everyone benchmarks the metric, no one names the agent.
πŸ“‘ Eco Products Radar
Eco Products Radar

Claude Code β€” still the default scaffold for autoresearch and agentic loops on closed models, especially with /goal-style modes and Ralph Loop running.

Hermes Agent (Nous Research) β€” dominant in the self-improving consumer agent niche this cycle. Self-rewriting skill files, USER.md, Mnemosyne consolidation overnight, Telegram/Discord 24/7.

Pi (Mario Zechner / @badlogicgames) β€” the open-weights-friendly harness for agentic loops. /goal mode, /until-done, Ralph loop with exit strategy. Cited specifically as where Kimi 2.6 catches up to Claude 4.6-7.

Codex / GPT-5.5 β€” paired with Goal mode for hours-long autonomous runs. Several builders moved primary daily-driver to it this cycle.

DeepSeek V4 Pro β€” the cost-curve crusher. DeepClaude points Claude Code's loop at DeepSeek for ~17x cheaper agent loops; cache-hit pricing makes loop primitives effectively free.

Karpathy autoresearch β€” now the canonical reference primitive for "agent runs experiments, edits config, restarts, benchmarks." Showing up wired into vLLM tuning, DeFi strategy discovery, code optimization.

EvoSkill V1 (Sentient) β€” open-source self-improvement loop that evolves skill files from failure traces; reported real benchmark gains on Claude Code-driven OfficeQA and SealQA.

Gemini CLI Auto Memory β€” Google's first explicit "skills from past sessions" feature; the auto-memory equivalent of what Hermes does manually.

Warp + Claude β€” vendor demo category for "self-improving agent" is now live; this is going to be a sales bullet point on every harness vendor's deck inside 90 days.

Agent harness (as category) β€” the meta-product. LangChain proved a 13.7-point benchmark jump on the same model just by changing the harness. Whoever ships the open-source harness that wins on prefix caching, cost, and self-rewriting skill files is going to define the next 18 months.
← Previous
Super User Daily: 2026-05-06
Next β†’
Ideas Radar: 2026-05-06
← Back to all articles

Comments

Loading...
>_