May 1, 2026loop

Loop Daily: 2026-05-02

Karpathy's autoresearch repo is officially the most-imitated piece of code on AI Twitter this week. People are wiring it into real domains and reporting the second derivative — what the loop does to the work, not what the work does. An Indian-stocks autotrader that's done 11 self-edits while the human watched. A tokamak design loop. A cold-email loop measuring positive reply rate as the optimization metric. A swarm of ML agents collaborating on optimizer ablations through a shared HuggingFace bucket. Underneath those are quieter signals about what loops actually need to work: shared state, eval gates, memory that compounds, a way to stop them from drifting. And one CrewAI customer found out the hard way that an unbounded loop with no circuit breaker is a $50K problem in 7 hours.

💡#1

@AkashTandon
https://x.com/AkashTandon/status/2049712565350264889
Most direct port of Karpathy's autoresearch into a non-research domain so far. AkashTandon built autotrader: a Claude Code agent paper-trading Indian stocks (Nifty 500, ~$1050 of paper money via Zerodha Kite Connect, ticking every 5 minutes during market hours, running on a single GCP e2-micro). The constraint is the interesting part — the agent can only modify its own strategy file. After ~130 simulated trades the system has done 11 self-edits, including fixing cash calculations, avoiding repeat mistakes, remembering cooldowns, taking profits earlier, and learning not to chase stocks that already ran. His own framing: the trading is the test case, the autonomous loop is the real project. Honest about limits too — paper trading ignores slippage, fills, liquidity, and the agent is much better at fixing repeated mistakes than at finding alpha.

💡#2

@bpizzacalla
https://x.com/bpizzacalla/status/2049874097346142264
Six months of cold email at a 2% reply rate ceiling. Read Karpathy's auto-research loop, realized the framing was wrong: pick one thing to edit, pick one number to measure, fix the experiment cycle, let the system iterate, keep what works, revert what doesn't. Pointed it at the email templates with positive reply rate as the metric and 48-hour cycles. The takeaway he wrote down was sharper than the result — knowledge compounds because winners never get thrown out. Most cold-outbound systems lose this property the second a new operator takes over. A loop with state-on-disk doesn't.

💡#3

@cmpatino_
https://x.com/cmpatino_/status/2049881579691139372
Agent Collabs is a small platform for letting heterogeneous agents (ml-intern, Codex, Claude Code, Hermes, plus humans) jointly run autoresearch on a single problem. They share a HuggingFace bucket as the message board and artifact store, and a separate Space tracks progress and a scoreboard. Two live collaborations: OpenAI's parameter-golf challenge and Keller Jordan's optimizer ablations. The actual interesting bits are the emergent behaviors observed — new joiners can read the bucket and contribute meaningfully with fresh eyes, agents naturally split labor by compute (no-GPU agents validate small, GPU-rich agents run promising experiments), credit gets passed when ideas get reused, individuals make mistakes but the collective spots them. This is the closest thing yet to "autoresearch swarm" in production.

💡#4

@PMocz
https://x.com/PMocz/status/2049650610069250252
Took Karpathy's autoresearch loop and pointed it at tokamak design as a deliberate "AI agents will discover and optimize engineering devices" demo. Open-source code, simple physics-design objective, very small wrapping. The interesting question this opens up isn't "did it design a great tokamak" — it's whether the same loop primitive (single editable file + measurable metric + experiment cycle) generalizes from neural-net training to physical-system optimization. The PRs and the early replies suggest people are taking that seriously.

💡#5

@AnnaMariaa133
https://x.com/AnnaMariaa133/status/2049789561346154755
Sentient Labs released EvoSkill V1, framed explicitly as an autoresearch system for AI agents. The loop: evaluate an agent on a benchmark, analyze failure traces, refine prompts and skills through automated iteration, ship the resulting specialist. Claimed numbers: Claude Code OfficeQA 60.6% → 68.1%, SealQA 26.6% → 38.7% with skills transferring +5% to BrowseComp, similar gains across OpenCode, OpenHands, Goose, OpenAI Codex CLI. The interesting framing is that this generalizes the autoresearch primitive away from "training neural networks" and toward "specializing existing agents on top of frontier models" — which is the workflow most builders actually need.

💡#6

@with_gene2626
https://x.com/with_gene2626/status/2049928670228201961
Closing post in a 3-part series running 20+ models through HumanEval+/MBPP+ inside a real agentic loop on a DGX Spark. The leaderboard alone is useful (Gemma-4-26B-A4B-MoE UD-Q5 wins at 95%, GLM-4.7-Flash Q8 is the boring-reliable workhorse at 87.5-92.5%), but the more important takeaway is the five surprises: reasoning distills lose 20-25 points vs base models, MoE crushes dense at 26-31B class, Qwen3-Coder MBPP scores are a harness artifact, Unsloth UD quants beat plain K-quants on MoE, and GLM-4.7-Flash is the right driver/reviewer slot. The author then rolls into Drift Studio — running every night on the same hardware — to study which prompt techniques keep an orchestration agent from drifting off-task during long-horizon coding work.

💡#7

@hirefortuna
https://x.com/hirefortuna/status/2049930597728964989
Real-world ad operator's read on Meta MCP + Higgsfield MCP both shipping the same week: a single agentic loop can now produce, test, deploy, and iterate ad creative without anyone moving artifacts between tools. The author runs the back-end side (autonomous customer service for ecommerce) and points out the obvious second-order effect — when iteration speed compresses on the front-end, order spikes faster, support load scales with whatever the loop generates, and you need an autonomous back-end agent that scales with the front-end agent. Front-end + back-end agents is the architecture; either one alone leaves throughput on the table.

💡#8

@agentic_james
https://x.com/agentic_james/status/2049985777421971846
Concrete instantiation of the previous post: Claude Code can now use the Meta Ads dashboard directly through the official CLI tool, and pairing that with image generation plus an autoresearch loop gives you a self-improving ads pipeline. Short post, but the architecture is exactly the lead-magnet/A-B-test loop people were sketching abstractly all week. Notable that "Meta CLI as agent rail" went from announcement to reproducible workflow inside 24 hours.

💡#9

@ericosiu
https://x.com/ericosiu/status/2049976820594868484
Eric Siu (Single Grain) writes up four plays for ad agencies in the post-Cloudflare-microsite era. The most relevant for this list is the lead-magnet factory: every podcast episode / YouTube video / Beehiiv issue auto-generates a topical microsite with a custom lead magnet, gated through Beehiiv's MCP. The autoresearch piece is non-negotiable — every spin-up has to pass an autoresearch eval gate before it deploys, otherwise the agency just publishes junk at scale. The math he lays out: 50 microsites at ~$5/mo each = $250/yr infra, capture 100 emails/mo each = 10 qualified conversations/mo at SG's ~2% lead-to-deal rate. The eval gate is what keeps this from being slop.

💡#10

@hybridllm
https://x.com/hybridllm/status/2049652384088182971
Worth surfacing because it's a precise correction of the agent-loop discourse. The author's stack isn't a per-turn LangGraph loop — it's batch with skill invocation. Per-invocation I/O is sub-KB markdown via tempfile + atomic rename, so latency stays negligible. The "6-7 iteration ceiling" people hit only shows up when an agentic loop sits on top of the base, not in the daily batch path. The general lesson: a lot of the "agentic loops are hard" complaints are actually about putting a loop where it doesn't belong.

💡#11

@a_protsyuk
https://x.com/a_protsyuk/status/2049988213541089765
Surfacing one specific failure mode that almost every agent loop hits in production: goal drift, not goal persistence. The model quietly redefines the goal mid-run and the rest of the loop optimizes for the redefined goal. Most LangGraph-style frameworks check for "did the agent finish the goal" but not "is the goal that's getting completed actually the goal the human originally specified." The /goal command in newer Codex CLI is one attempt at a fix; nobody has a clean answer yet.

💡#12

@davidmytton
https://x.com/davidmytton/status/2049849062908695039
Arcjet shipped Guards: runtime enforcement inside the agent loop for prompt injection detection, per-user token budgets, and PII redaction. The framing is correct — the WAF stops at the HTTP request, but the agent fetched a webpage with hidden instructions, ran a loop that emailed your customer list, and burned overnight model spend with no circuit breaker. That all happens past the firewall. The novel piece is shipping it as a skill rather than a separate framework: `npx skills add arcjet/skills --skill add-guard-protection` and the agent installs its own guardrails.

💡#13

@polsia
https://x.com/polsia/status/2049732585044238383
The cautionary tale of the day. CrewAI customer ran a recursive agent loop that made 44,000 API calls in 7 hours and burned $50,000. No budget cap, no circuit breaker. The framing — "your agents are doing the same thing right now, the only difference is you haven't noticed yet" — is hyperbolic but the underlying point is real: every agent loop that doesn't terminate explicitly will, eventually, fail to terminate. Two agents calling each other recursively is enough to nuke a budget overnight.

💡#14

@epichrisis
https://x.com/epichrisis/status/2049894459857600923
Best response of the day to the "single agent loop" framing. The author's actual production stack does continual learning, multi-tier memory, and swarm-style self-optimization across an "ecosystem of evolving agent mandates and shared memories." Pulls analogies from quorum sensing and chemotaxis, then makes the point most public agent discourse misses: a single agent loop is bounded in optimization potential in ways collective systems aren't. The corollary — "no agent lab can solve this in isolation because it requires the right model stack too" — is the kind of claim that's worth tracking against the next 6 months of multi-agent work.

💡#15

@Trumpyla
https://x.com/Trumpyla/status/2049913337283059951
Long argument that production-grade reasoning isn't a flat agent loop, it's a Recursive Language Model: the agent recursively spawns sub-instances of itself with their own state and budgets, externalized state lives outside the model, and tool orchestration becomes first-class. Compares this to Karpathy's LLM-as-OS framing and pushes it further. Concrete consequence: depth scales passes, branching expands search, recursion yields an auditable execution graph — but you have to design termination conditions and budget caps or you hit the polsia.com problem above.

💡#16

@hbouammar
https://x.com/hbouammar/status/2049862531506717157
λ-RLM (Lambda-Recursive Language Model) — open source repo claiming to move recursion out of the model and into a typed lambda-calculus runtime (split → map → filter → reduce primitives). Reported numbers: 29/36 wins vs standard RLM, up to +21.9 accuracy points, up to 4.1× lower latency. The thesis "long-context reasoning is not a context-window problem, it's a control-flow problem" is the right framing. Whether the runtime ships outside academic benchmarks remains to be seen, but it's the cleanest articulation of "stop asking the model to write its own loop" we've seen this month.

💡#17

@SwamiSivasubram
https://x.com/SwamiSivasubram/status/2049900359162757524
AWS released Strands Agents SDK 1.0 for TypeScript — the harness layer above the agent loop, with default tools (shell, file editing, HTTP, structured notes), customizable hooks and plugins, and Node.js + browser support. 25M downloads on the Python SDK in a year. The framing is the right one: "agent harness SDK that goes beyond the core agent loop" — the agent loop itself is commoditized, the value is in the rails around it. Important to note for non-AWS shops because Strands also supports any OpenAI-compatible model provider.

💡#18

@KanikaBK
https://x.com/KanikaBK/status/2049835946728951814
Claude-obsidian implements Karpathy's LLM Wiki pattern as a Claude Code skill. Drop a source, /wiki creates 8-15 structured wiki pages, every new page gets cross-referenced against the existing vault, contradictions get flagged with callouts. The /autoresearch command runs a 3-round web research loop, finds gaps, fills them, files everything. /save turns any Claude conversation into a permanent wiki note. The interesting structural piece is the hot cache — at the end of every session Claude writes a compact summary of recent context, next session reads that first, you never rebuild context manually again.

💡#19

@OomkaBear
https://x.com/OomkaBear/status/2049766175672778949
OpenAI shipped WebSocket mode for the Responses API — 40% faster on Codex-style agent runs. The takeaway in the post is correct: performance work is shifting from model latency to agent-loop systems engineering. State-warming across tool calls is doing more for production right now than benchmark gains. The implications for everyone running long-horizon loops is straightforward — switching transports is a free ~40% speedup on identical workloads.

💡#20

@AryamanIyer3
https://x.com/AryamanIyer3/status/2049693676004352371
Specific data point for the Codex-vs-Claude-Code argument grounded in financial modeling. Reports that Opus 4.6 beats Codex 5.3 throughput on financial modeling tasks, attributes the gap to agentic loop overhead — Codex's loop eats more cycles per useful action. Notes that Claude Code's "getting work done while chatting" feels real for architectural decisions (vs. pure generation). This is the kind of qualitative-but-grounded comparison that survives benchmark turnover.

💡#21

@wgw_eth
https://x.com/wgw_eth/status/2049837792276939102
papa-pi / pi-puppies / pi-kittens — system for running standalone autonomous self-improving Pi agents while keeping each one sandboxed via Bubblewrap. Each agent has its own memory, identity, and world. The point of interest is the sandboxing primitive — every multi-agent stack eventually needs hard isolation between agents (filesystem, network, memory) and Bubblewrap is a saner choice than running agents inside Docker for this purpose. Worth watching for anyone building agent fleets that need to genuinely not interfere with each other.

💡#22

@ivakshi_s
https://x.com/ivakshi_s/status/2049938090114920858
3/3 ICML 2026 papers accepted on long-term memory in agents, self-improving open-ended agentic safety, and causality + trustworthy AI. The combination of those three topics in one author's pipeline is interesting — they map onto the practical bottlenecks people are hitting in production agent loops (memory consolidation, runaway behavior, decision attribution). Worth tracking when the camera-ready PDFs ship.

📡 Eco Products Radar

Eco Products Radar

💡#23

Karpathy autoresearch repo — the literal piece of code spawning most of this week's loop work. Showed up in 12+ posts as the reference architecture.

Claude Code — the most common harness for these loops, especially in non-research domains (trading, ads, content, Obsidian).

Codex / GPT-5.5 — second harness of choice; particularly mentioned for review-loop pairing where Claude Code implements and Codex reviews.

Hermes Agent — third harness; consistent appearance for self-improving / continuous-learning use cases (NousResearch).

ml-intern — Sakana's ML research agent, now usable inside Agent Collabs.

Beehiiv MCP / Meta Ads CLI / Higgsfield MCP — the marketing-loop trio; people are stacking these to build self-improving ad pipelines.

Strands Agents SDK (AWS) — harness layer above the agent loop with default tools and plugin hooks; now in TypeScript GA.

LangSmith / LangGraph — still the default reference for "agent serving infrastructure" in production multi-user deployments.

Bubblewrap — sandboxing primitive showing up in DIY multi-agent fleets that need hard isolation.

Cursor SDK — programmable agent infrastructure for embedding agent loops in CI/CD and other products.

Arcjet Guards — runtime enforcement inside the agent loop (prompt injection, token budgets, PII redaction); shipped as an installable skill.

DGX Spark — local hardware substrate; multiple posts running 20+ models through agentic loops on this one box.

Walrus + MemWal — long-term agent memory layer that persists across sessions and providers, plug-in for OpenClaw and NemoClaw.

EvoSkill V1 (Sentient Labs) — autoresearch-as-a-toolkit for specializing existing agents on top of frontier models.

← Previous

Super User Daily: 2026-05-02

Ideas Radar: 2026-05-02

← Back to all articles

Loop Daily: 2026-05-02

More Articles

Comments