May 7, 2026loop

Loop Daily: 2026-05-08

May 6 was the day "loop" stopped being a Twitter buzzword and became actual production architecture. On the autoresearch side, two crypto-native teams shipped on-chain markets where miners run autoresearch loops in TEEs and earn rewards for verified benchmark improvements. On the agentic-loop side, Anthropic shipped Dreaming + Outcomes + Multi-Agent + Webhooks as first-party features, which is exactly the stack that power users have been wiring by hand for months. Underneath, the more interesting signal is structural: people are reporting back from real loops with hard numbers — 11% training speedups in 2 days, agent runtimes that canonicalize model identity at the boundary, context compaction layered between turns, autoresearch agents that improve their own landing pages while you sleep. Below are the cases worth copying.

💡#1

@dair_ai
https://x.com/dair_ai/status/2052125514266190286
Microsoft Research's Agentic-imodels paper is the cleanest autoresearch demo of the day. A coding agent (Claude Code or Codex) iteratively evolves scikit-learn-compatible regressors that are simultaneously accurate and readable by other LLMs. Interpretability is measured by whether a small LLM can simulate the model's behavior just from its `__str__` output. Across 65 tabular datasets the discovered models push past every classical interpretable baseline (decision trees, GAMs, sparse linear) and improve four downstream agentic data-science systems on BLADE by 8%-73%. This is autoresearch turned into a tool-design methodology rather than a model search.

💡#2

@sheetalojha4
https://x.com/sheetalojha4/status/2051990094295552305
The "AutoResearch on a single machine found an 11% training speedup in 2 days" line is the one to remember. The team built infrastructure to run that loop across thousands of untrusted nodes with cryptographic guarantees, which is the actual contribution. Open Research turns any GitHub project into an AutoResearch benchmark with on-chain rewards, miners compete to improve it using coding agents, TEEs verify code + results before the on-chain reward settles. Closed-loop scientific discovery without humans in the loop.

💡#3

@techtusharojha
https://x.com/techtusharojha/status/2052012521280979183
Same Open Research project from another co-founder with sharper framing: AI agents race to beat your benchmarks on-chain, miners run AutoResearch loops in Docker sandboxes, winning commits get re-executed inside Intel TDX or AMD SEV TEEs, attestation triggers on-chain reward settlement, no human in the loop. This is "AutoResearch as a sport" with cryptographic verification — the missing piece that makes Karpathy's loop economically actionable.

💡#4

@alokbishoyi97
https://x.com/alokbishoyi97/status/2051939567125803075
Evo, an open-source autoresearch project from the same builder community: parallel and tree/graph search, configurable node selection (GEPA, eps-greedy and others), supports remote containers (Modal, e2b, Daytona, AWS, Azure or your own box), and runs on Claude Code or Codex. The point of evo is the same as the Open Research bet — autoresearch is becoming a productized search-strategy harness, not a one-prompt template. If you're trying autoresearch and getting noise, your node-selection strategy is probably the bug.

💡#5

@Raz_Ciuca
https://x.com/Raz_Ciuca/status/2051996077167894813
The lesson nobody's saying out loud: AlphaEvolve and autoresearch don't work because "you tried enough things bro" — if that were true, dumb zeroth-order optimisers like CEM would dominate. The real lesson is that you need incredibly strong priors for your search space, and LLMs give exactly that. This is the most useful autoresearch take of the day because it tells you what to optimize: prior quality, not iteration count.

💡#6

@kwuwon
https://x.com/kwuwon/status/2051991714915860615
A real overnight-loop case: kwuwon is using `gpt-image-2` inside Codex with autoresearch to tune prompts while he sleeps. This is the smallest possible useful autoresearch loop — image-gen prompt search overnight, eval the next morning, keep what worked. The asset is the eval, not the model.

💡#7

@rokbenko
https://x.com/rokbenko/status/2052088827066364396
Open-sourced an autoresearch agent for landing-page CRO. Loop: read the LP code → generate a hypothesis → rewrite the UI → push to GitHub → wait for real visitors → measure impact via PostHog or Plausible → keep the change if it worked or revert. SCOPE.md is the system prompt that tells the LLM what it can and cannot touch (default is restrictive: copy, button styles, above-the-fold reorder; off-limits: navigation, pricing, auth, state, API). This is "Karpathy autoresearch but for CRO" — same loop, conversion rate as the eval.

💡#8

@thanford7
https://x.com/thanford7/status/2052116203758612635
The clearest articulation of why autoresearch generalizes: "Allowing agents to run in probabilistic loops with an eval harness and an explicit improvement goal is one of the highest-leverage uses for LLMs." Most people treat evals as regression tests, which only catches bugs. Once people realize evals enable auto-improvement, evals stop being annoying overhead and become the moat. This is the mindset shift the next 12 months of agent product design will turn on.

💡#9

@_shubhankar
https://x.com/_shubhankar/status/2052122661883670580
Autobrowse: autoresearch applied to browsing. They've been exploring recursive self-improvement specifically for browser-using agents and call it "the Mythos moment for browsing". The framing — recursive self-improvement of browsing strategies via autoresearch — is exactly the missing piece for browser agents that currently fall apart on long sessions.

💡#10

@LukeParkerDev
https://x.com/LukeParkerDev/status/2051859365477650877
"who wants autoresearch in opencode desktop?" 214 likes and 8.6K impressions in a few hours. The signal: autoresearch is now a feature that desktop coding-agent users actively request, not an experimental research artifact. The opencode community is one of the closest to the ecosystem, and they're pulling autoresearch into the IDE.

💡#11

@browser_use
https://x.com/browser_use/status/2051826281914978801
Hermes agent gained a browser-harness skill — self-improving browser tools, parallel stealth cloud browsers, full freedom inside the user's browser, all with one prompt. 1863 likes and 102K impressions, the loudest single autoresearch-adjacent product post of the day. The "self-improving" framing here means the agent's browser-tool-use evolves with usage, which is the exact pattern Anthropic just shipped as Dreaming on the model side.

💡#12

@runzhuotao
https://x.com/runzhuotao/status/2052107034699669878
A working self-improving Blender agent on GPT-5.5 making steady progress in procedural mesh modeling. He says results are getting more coherent and reliable with each iteration and he's pushing toward a reusable pipeline. This is the rarest kind of post — a domain-specific self-improving loop in a non-coding non-text domain (3D mesh generation), where the eval signal is the user's "is this mesh acceptable" judgment.

💡#13

@hirefortuna
https://x.com/hirefortuna/status/2052137835075940816
The first commercial product to publicly route requests across Anthropic + OpenAI + Google + SpaceXAI + Meta and use Anthropic's new Dreaming + Outcomes + Multi-Agent + Webhooks stack for ecommerce customer service. They call it "the structural unlock for self-improving autonomous agents" and they're already in production. Memory consolidation between sessions, rubric-based outcome grading, orchestrated subagents in parallel — all the pieces autoresearch users have been wiring by hand are now first-party.

💡#14

@brentdsummers
https://x.com/brentdsummers/status/2052100049077985334
On Anthropic Dreaming: "the first time a major model is shipping persistent, self-improving agents out of the box. Builders no longer have to hack together memory layers or endless prompting loops." The thread complements @hirefortuna by stating the same shift in plain product terms — the autoresearch loop has graduated from a custom builder pattern to a default product capability.

💡#15

@mudirshin
https://x.com/mudirshin/status/2052060400435249530
Sharper take on the same news: "Self-improving AI agents that learn while they sleep is not a small update… If dreaming works the way it sounds, the gap between Claude and the rest just got wider overnight." Worth flagging because the gap mudirshin describes is exactly the autoresearch overhang — Anthropic productizing what power users were doing manually.

💡#16

@MagicalTux
https://x.com/MagicalTux/status/2051971851354878441
A specific multi-agent autoresearch pattern in production: specs are written by 4 agents with separated roles and contexts, an overviewing agent enforces rules and documents each agent's work, then yet another agent reads and implements the final specs. This is what "multi-agent orchestration" actually looks like at the implementation level — not "spawn 100 agents" but "4-role spec team + reviewer + implementer with strict context isolation between phases."

💡#17

@MrAhmadAwais
https://x.com/MrAhmadAwais/status/2052063719702855883
The deepest agent-runtime engineering deep-dive of the day. Command Code AI hit a multi-day rabbit hole when supporting mid-conversation model swaps in the agent loop. Key lessons: every "obvious" constant in an agent runtime is a future bug (their 200K context constant worked for 8 months until it didn't); reconcile usage state on switch instead of just re-rendering; only compact on a shrink, not on equal-or-wider; lock the reconcile path against double-clicks. The bug that bit them: model identity was a string-equality check, but each gateway uses a different slug convention, so context-limit lookup was missing for ~3 gateways and silently auto-compacting at 100K instead of 500K.

💡#18

@AnishDabhane
https://x.com/AnishDabhane/status/2051919721537441852
Hermes-agent's two-layer context compaction algorithm, written up cleanly. Layer 1 (Gateway) fires at 85% context outside the agent loop — the safety net for Telegram/Discord where messages pile up between turns. Layer 2 (Agent Compressor) fires at 50% inside the loop using exact token counts from the previous API response, with a structured 4-step compression: delete old tool outputs, mark head + recent tail to keep, summarize the middle with an aux LLM, rebuild head + summary + tail. The summary format has fixed slots (Goal, Constraints, Progress, Decisions, Files, Next Steps, Critical Context), and on the next compression the old summary is updated rather than rewritten — context quality stays high across long sessions.

💡#19

@Jeyxbt
https://x.com/Jeyxbt/status/2052040517832081659
400+ hours of Claude Code distilled into a "stop hitting the wall" setup. Claude Code is a harness, not a model — the harness (file editing, skills, agentic flow, terminal UI) is what users love, the API call underneath is swappable. Set up a proxy that intercepts the Anthropic API, point it at DeepSeek V4 ($2-5 top-up, full tool calling so all skills keep working), or rotate through OpenRouter's free pool for free tier. Then run three terminals in parallel: Claude Opus/Sonnet, DeepSeek V4, free OpenRouter rotation. Each on a different cost tier, sharing the workspace. The mental model: Claude is the design king (UI, copy, taste), DeepSeek crushes the dirty work (refactors, tests, async edge cases), Codex is the review pass.

💡#20

@so_sthbryan
https://x.com/so_sthbryan/status/2051824012188135773
DeepClaude lands on HN frontpage (464 points): an open-source project that runs Claude Code's agent loop on DeepSeek V4 Pro at 17x cheaper per task. Same Claude Code interface, DeepSeek's API pricing. This is the same playbook as @Jeyxbt above — the agent loop and tool integration is the value, the underlying model is fungible.

💡#21

@kocer_eth
https://x.com/kocer_eth/status/2052138613769474434
Sharper analysis of why DeepClaude matters: "if the claim holds, this is about keeping Claude Code's agent loop but routing it through cheaper Anthropic-compatible backends. Best for: long autonomous runs, experiments, cost-sensitive workflows." The autoresearch implication is the one to underline — people running overnight autoresearch loops have a real incentive to swap the model under the harness.

💡#22

@cubafran
https://x.com/cubafran/status/2052030326046683155
Tiny but useful loop fix: AI agent fills a signup form, hits "Check your email," then dies. Fix: Claude Code + an MCP email server so the agent can read the OTP and continue. Most agentic loops break exactly here — at the email-OTP boundary. Adding MCP email turns a one-shot signup into a fully-automated signup loop.

💡#23

@AmMrAnonymous
https://x.com/AmMrAnonymous/status/2051944816389333380
Tightening the loop with deterministic feedback: Claude Code writes `<button className="bg-[#1a5276] text-white">`, Deslint MCP responds "3.2:1 contrast, off-token, no dark variant," Claude Code fixes it inside the same agent loop before the code reaches the user. Determinism in, better code out. This is the autoresearch pattern but with deterministic eval (lint rules) instead of probabilistic eval — exactly what Karpathy's "any editable file + measurable metric = automated experiment loop" formulation promises.

💡#24

@sqs
https://x.com/sqs/status/2052129216007971230
Sourcegraph's Amp CLI now runs the agent loop on the server, sending ~95% less data to and from the user's machine. The headline feature is mobility (works on airplane wifi), but the architectural shift is what matters — once the loop runs server-side, true remote/headless agent execution becomes the default and your laptop sleeping doesn't break overnight runs.

💡#25

@aiwithjainam
https://x.com/aiwithjainam/status/2052003742959259732
DeepSeek-TUI ships with everything Claude Code has minus the subscription: file editing tools with diff previews, shell execution inside the agent loop, web browsing for live docs and references, Git operations native to the agent, session resume so context survives quits. One npm install, login once, MIT-licensed. Another instance of the harness-vs-model split that defined this week.

💡#26

@JulianGoldieSEO
https://x.com/JulianGoldieSEO/status/2051929960269611484
Ruflo's role-decomposition loop: Architect agent plans, Coder agent builds, Tester agent checks, Reviewer agent improves, with shared memory keeping them aligned. Generic by itself but useful as a baseline contrast against the @MagicalTux 4-spec-agents pattern — same idea, different role split, same reliance on isolated contexts plus shared state.

💡#27

@therobertta_
https://x.com/therobertta_/status/2051950321501630699
The four mistakes most teams make in agent harnesses: bundling orchestration and execution in one process, single tool timeout freezing the entire loop, retry logic inside the LLM call instead of outside, no isolation means no independent scaling. Demos work, 100 concurrent users kill it. He's seen this kill 3 agent startups in 6 months. Architecture is the bottleneck more often than the model.

💡#28

@meta_alchemist
https://x.com/meta_alchemist/status/2051974293328896277
Spark: a recursive self-improving personal agent OS with an agentic tools ecosystem. Self-described as "the recursive self-improvement loop, productized." The pitch is at OS-level not skill-level — a substrate for agents to evolve in over time, not a single self-improving agent.

💡#29

@JackyisThinking
https://x.com/JackyisThinking/status/2051984289827631222
The memory-architecture map of the day. Three approaches to AI long-term memory have emerged and each handles a different constraint: OpenClaw + Hermes (loop-driven session memory), self-evolving graph memory in the Garry Tan / gbrain style, and Mem Palace's precision retrieval style. The author is integrating all three to give AI human-like memory that grows with users. Memory is the autoresearch eval surface — without persistent memory, every loop starts from scratch.

💡#30

@token_forge007
https://x.com/token_forge007/status/2051842577956217131
A production self-improving writing agent finally working after months of failed attempts. He calls it Meridian Agent and notes it's "actively generating revenue for 160 users." The interesting bit isn't the product — it's the offhand "after months of failed attempts" admission. Self-improving writing agents are not a free trick. The eval signal is the hard part.

💡#31

@avaxnaut
https://x.com/avaxnaut/status/2052085841715999040
Reply to Boris Cherny describing a self-improving knowledge graph built in Claude Code: a streaming language and data structure storing relationships between data factums with V&V (verification & validation) weights, designed to self-improve the graph, the language itself, the depth, the reach, and the autonomy of the system. Not a finished product, but the framing — autoresearch over a knowledge graph that includes the language definition itself — is the most ambitious loop architecture in this batch.

💡#32

@DataChaz
https://x.com/DataChaz/status/2052078189367947674
Claude Code Routines dropped only days ago and Multica Autopilot already cloned it. The point: routines (scheduled-loop primitives) are now possible to run entirely locally with whatever agent you want — Opencode, Codex, Hermes, OpenClaw. The autoresearch implication: cron-based loops are no longer locked to one vendor's harness, which materially changes who can run overnight experiments.

💡#33

@curiosity_41
https://x.com/curiosity_41/status/2052163544217694683
`era` is a new Rust v0 prototype for cheap snapshots, workspace cursors, and `era watch` auto-snapshots with agent/task/model provenance. Useful for auditing parallel coding-agent runs. The autoresearch link: when you run dozens of agent variants in parallel, you need provenance to tell which trace produced which improvement. Era is one of the first tools explicitly designed for that audit need.

💡#34

@hosseeb
https://x.com/hosseeb/status/2051841113657643397
"now your agent can fix itself" — Raindrop Triage, an agent for finding and investigating agent issues. Self-fixing agent diagnostics is the meta-loop: the eval target is the agent's own failures. This pattern is going to multiply because every team running long autoresearch runs is sitting on a graveyard of broken-mid-run agent traces and someone has to triage them.

💡#35

@rise_raise_ai
https://x.com/rise_raise_ai/status/2052117937067036980
On Cursor's announcement: "Self-improving loop unlocked: previous Composer models auto-setting up RL dev environments for the next gen. Pure bootstrapping elegance — each version focuses purely on harder problems. This is how frontier labs accelerate." The framing matters — Cursor isn't shipping a self-improving feature, they're shipping evidence that the loop already runs internally at frontier labs.

📡 Eco Products Radar

Eco Products Radar

DeepClaude — open-source Claude Code agent loop running on DeepSeek V4 Pro at 17x cheaper per task; HN frontpage (#464). Same harness, swappable model.

DeepSeek-TUI — DeepSeek's own coding-agent harness with file editing, shell, web, Git, session resume; MIT, one npm install.

Open Research / AutoResearch (on-chain) — turns any GitHub repo + benchmark into a TEE-verified mining game; Karpathy's autoresearch loop with cryptographic settlement.

Anthropic Claude Managed Agents (Dreaming + Outcomes + Multi-Agent + Webhooks) — first-party version of patterns autoresearch power users wired by hand.

Hermes Agent — multi-skill agent harness; new browser-harness skill ships self-improving browser tools and parallel stealth cloud browsers; cleanest two-layer context compaction in the wild.

Multica Autopilot — open-source clone of Claude Code Routines that runs locally with Opencode / Codex / Hermes / OpenClaw, breaking the routines feature out of one harness.

Raindrop Triage — agent for finding and investigating other agents' issues; the first widely-shared product targeting agent self-diagnostics specifically.

Spark — recursive self-improving personal agent OS; OS-level rather than skill-level self-improvement.

← Previous

Super User Daily: 2026-05-08

Ideas Radar: 2026-05-08

← Back to all articles

Loop Daily: 2026-05-08

More Articles

Comments