May 9, 2026loop

Loop Daily: 2026-05-10

May 8 was the day autoresearch graduated from a Karpathy demo to a production economic argument. Cursor's SDK shipped /orchestrate, where agents recursively spawn parallel sub-agents, and Cursor itself reported a 20% token reduction in their internal auto-research pipeline. OpenAI's Codex got hooks support — extensibility points where you inject your own scripts mid agentic loop, the same pattern Claude Code shipped first. A Kaggle competitor told the world that GPT-5.5 reasoning sometimes "loses its composure and starts soft-raging at bad experiments" while the auto-research agent climbed within reach of gold. Razorpay's principal architect wrote a thousand words explaining that the agentic loop is the easy part — the hard part is making it production-grade under real load. And Andrej Karpathy's autoresearch repo crossed 79K stars while the community kept porting it sideways into trading, condensed-matter physics, evolutionary economics, and now landing-page copy.

💡#1

@4DRp0iHGeKdYH0T
https://x.com/4DRp0iHGeKdYH0T/status/2052990115769979308
Codex /goal autoresearch session, full receipts. 15 hours of runtime, $500+ in API spend, 90+ commits produced. Reports CI failures absolutely nuked his inbox. This is what the hardware shape of "AI overlords" actually looks like in 2026 — overnight and overday autonomous compute spent on a single ticket, with the human cost being inbox triage, not coding.

💡#2

@kibubble_de
https://x.com/kibubble_de/status/2053027538620813626
Cursor SDK gets /orchestrate. Agents recursively spawn parallel sub-agents that work in parallel and feed results back. Cursor's own internal auto-research pipeline saw 20% token reduction and 80% cold-start reduction in backend tasks. The framing is hard: single-agent loops are now legacy. The vendor that ships the loop architecture also runs that architecture on themselves first.

💡#3

@moshuishapaozi
https://x.com/moshuishapaozi/status/2053038149107056883
Building an auto-research framework for US-stock investment research. Multiple agents in adversarial roles — one researches sectors and routes work, one runs evals and challenges, one Agent per stock with the user's saved stock-analysis Skill. Hard rules: sector eval loops until pass, all candidates get individual research, every stock gets 30+ sources, every report gets a light review, fail = re-run. Says basic mental work value is dropping fast but the human cognitive load is rising because the universe of opportunities the loop can analyze just exploded.

💡#4

@ar0cket1
https://x.com/ar0cket1/status/2052979876546887726
Codex /goal is the best feature in the product. Long-horizon tasks, 10-hour autonomous runs, and crucially "/goal fixes the auto research issue codex had." This is the missing piece — until /goal landed, Codex stopped after a few turns and required manual queueing. This receipt is the user-side confirmation that loop-termination logic is now stable enough to bet a 10-hour task on.

💡#5

@flock_io
https://x.com/flock_io/status/2053023203233271913
Logan Kang from Dable presented Auto Research at a Korean AI session — agentic AI that helps teams turn research into repeatable real-world tests faster. Korean enterprise productizing the autoresearch idea is the second-derivative signal: not just a Twitter shape, but a corporate program at a real company.

💡#6

@AINativeF
https://x.com/AINativeF/status/2052900413301776562
Paper drop: "Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes" (Ning, Li, Zeng, Kang, Xiong). Empirical loop where specialist agents create trials with code edits and evaluations, iterating over an auditable trajectory. Receipts: significant improvements on Parameter Golf validation, NanoChat-D12 CORE, CIFAR-10 Airbench96 wallclock without human proposal or intervention. The academic version of pi-autoresearch is now publishing.

💡#7

@JeremyNguyenPhD
https://x.com/JeremyNguyenPhD/status/2053082260132573517
"I left 3 AI agents alone with a research problem overnight. They came back with 72 peer-reviewed papers." Quote from Prof Jie Ding (University of Minnesota) opening the WorldSeed autoresearch composition framework. The receipt is in the unit: 72 actual peer-reviewed papers found and triaged by an autonomous loop, not a single chat output, not a synthetic experiment count.

💡#8

@arpit_bhayani
https://x.com/arpit_bhayani/status/2053091711698768357
Razorpay's principal architect on production agentic systems. The agentic loop is the easy part. What scales is system design — microservices, message queues, consistency guarantees, load balancing, work distribution, state management, rate limiting, throttling, fallbacks, service-to-service comm, QoS. The difference between prototype and production code is 15 components and 1000 commits. This is the closest thing to an institutional voice saying out loud that the chat→agent transition is a distributed-systems problem, not an AI problem.

💡#9

@8teAPi
https://x.com/8teAPi/status/2053025212653076602
Running Claude Code Opus 4.7 for planning and review + GPT 5.5 high in Codex for execution as a full-scale agentic loop. Reports it as "incredible" once project structure and scaffolding are right. The dual-model architecture is now a settled pattern — one model picks the moves, another model executes.

💡#10

@kylejeong
https://x.com/kylejeong/status/2052873208668524917
OpenClaw + Autobrowse iteratively builds a Skill for any browser workflow. Craigslist extraction example: 5 iterations got 68% speed-up and 91% cost savings. Halfway through, the agent discovered an exposed endpoint and used it to skip page navigation entirely. This is the most concrete demonstration of "Skill compilation as autoresearch" yet — the loop doesn't just optimize, it discovers strategies a human would miss.

💡#11

@testingcatalog
https://x.com/testingcatalog/status/2052882191940534531
Hooks support is coming to the Codex app. Hooks are an extensibility framework that let you inject your own scripts into the agentic loop. The strategic point: Claude Code shipped hooks first, and the IDE-agent differentiation now lives in the hooks layer, not the model swap. Codex catching up on this dimension is more important than the model spec battle, because hooks are where teams actually customize behavior.

💡#12

@MinLiBuilds
https://x.com/MinLiBuilds/status/2052188818137330043
Anthropic's beta-tier features are real now that the SpaceX compute deal landed. Three: Dreaming (memory consolidation function), Outcomes (Anthropic's productized version of Codex's /goal — autoresearch wrapped as task-completion guarantee), and Multiagent (a primary agent that spawns multiple agents on demand for complex tasks). The user is half-laughing at himself for hand-rolling /goal as a CC plugin right before the official version dropped.

💡#13

@aiwithmayank
https://x.com/aiwithmayank/status/2046914454353510893
Catalog of every Karpathy autoresearch fork in one place. macOS Apple Silicon port. Windows RTX consumer-NVIDIA port. WebGPU browser port. Multi-GPU + crash recovery. Colab/Kaggle T4 free port. And the sideways applications: a trading agent optimizing prompts against rolling Sharpe ratio instead of model loss, a genealogy researcher iteratively expanding family history, a Spring Boot service that grew from 119 lines to 950 in 5 autonomous cycles. The original idea — give an AI a metric, let it self-improve until it wins — works on almost anything that has a measurable target.

💡#14

@samhogan
https://x.com/samhogan/status/2049619541727302040
HALO (Hierarchical Agent Loop Optimizer) open-sourced by Jambo. RLM-based recursive self-improvement that analyzes execution traces and suggests harness changes. Result on AppWorld benchmark with Sonnet 4.6: 73.7 → 89.5, +15.8 points. The feedback surface includes hallucinated tool calls, redundant arguments, refusal loops, semantic correctness — each issue maps cleanly to a prompt update. Then they fed those findings into Cursor (Opus 4.6) and looped on harness updates until score plateaued. This is the meta-loop: an AI improving an AI's harness using an AI to write the patches.

💡#15

@ShenHuang
https://x.com/ShenHuang/status/2043469166418735204
Spent hundreds of millions of tokens debugging a race condition. Failed. Then borrowed from Karpathy's auto-research and added one rule: "write all hypotheses and evidence to DEBUG.md." AI listed 5 hypotheses. The third had no contradictory evidence. 3-line experiment, root cause confirmed, fixed in 5 minutes. Brute-force token spend was 1000x larger than the actual fix. Four debug rules: hypothesize before changing code, max 5 lines per experiment, write all evidence to file (so context compaction doesn't lose the chain), 2 failures in same direction = forced hypothesis switch.

💡#16

@ShopifyEng
https://x.com/ShopifyEng/status/2044477537200550383
Since open-sourcing pi-autoresearch, Shopify teams have been running it on everything. Receipts: unit tests 300x faster. React component mounting 20% faster. CI build time 65% reduction. pnpm faster. The frame: autoresearch never stops trying things you'd never have time to try. This is one of the few hard production-economic numbers from a real company on the value of autoresearch loops.

💡#17

@sudoingX
https://x.com/sudoingX/status/2052361613651701933
Honest tester verdict on tool-use benchmark v1: a single happy-path task didn't differentiate two competent agentic styles. Vanilla qwen 3.6 ran the task in 12 tool calls vs carnice-v2's 19, finished 11:37 vs 12:23, but generated more reasoning per message and emitted reasoning on 100% of messages vs 71%. v2 of the bench going harder: adversarial scenarios, error injection mid task, multi-step orchestration with broken intermediate state, 3 runs per model for variance, harder tasks. Real benchmark hygiene from someone running their own agent loop trials.

💡#18

@grapeot
https://x.com/grapeot/status/2051734189054255164
The biggest change in AI tooling in 2 years isn't prompt complexity — it's that scaffolding is being commoditized. Prompt engineering tricks get absorbed by models. Agent loops, file/shell access, test feedback, context compression are now Claude Code / Codex / Cursor / OpenCode runtime features. What's left worth maintaining yourself: domain context, evals, permission boundaries, quality standards, judgment frameworks. The work is migrating from execution to boundary judgment.

💡#19

@TeksCreate
https://x.com/TeksCreate/status/2053151671966986735
DeepClaude is open-source: runs Claude Code's agent loop with DeepSeek V4 Pro instead of Anthropic. 17x cheaper. Keeps multi-step reasoning, file ops, debugging. Already running deepseek-v4-pro? You can do this today. The loop is now portable across providers — the harness ships, the model gets swapped at config level.

💡#20

@sentient_agency
https://x.com/sentient_agency/status/2045065544668528870
MiniCode shipped — Claude Code's open twin with the same agent loop, tool model, and TUI architecture, built to be understood. The replicated set: model→tool→model loop, review-before-write with unified diff, dynamic MCP over stdio, local skills via SKILL.md, reject-with-guidance pushing corrective instructions back mid-loop, run_command with single-string invocations, explicit background shell tasks. TypeScript reference + Rust + Python implementations. MIT.

💡#21

@OpenAIDevs
https://x.com/OpenAIDevs/status/2044466729712304613
A harness that keeps long-running agents on track. Manages the agent loop across tools, context, and traces. The sandbox preserves working state across pauses, retries, and resumptions. Posted as production-grade infrastructure, not a toy. The pattern is now mainstream enough that OpenAI ships the harness as a first-class artifact alongside the model.

💡#22

@m13v_
https://x.com/m13v_/status/2052940134077898852
Hooks are quietly the most underrated piece of any agentic loop. Claude Code shipped them first; most of the IDE-agent differentiation now lives in the hooks layer, not the model swap. The argument matters because it reframes the AI dev tool wars as harness wars, not model wars.

💡#23

@m13v_
https://x.com/m13v_/status/2053123934435029047
The hard part of agentic loops in production isn't the loop — it's the regression tail. What happens to your eval scores when a tool API silently changes its response shape on Tuesday? Most teams skip a real eval harness and only catch it in prod. This is the second voice in the same day arguing that production agent reliability is an eval problem, not an architecture problem.

💡#24

@mylifcc
https://x.com/mylifcc/status/2053100765674365070
Agentic loop is the easy part. What bites in prod isn't wrong tool output — it's the loop retrying a tool that already succeeded, or retrieval drifting 3 turns in. Specific failure modes that don't show up in demos but kill production deployments. Worth bookmarking because every team building a multi-step agent will hit this exact wall.

💡#25

@RoundtableSpace
https://x.com/RoundtableSpace/status/2047325872986755482
/autobrowse skill inspired by Karpathy's autoresearch harness. Give the agent any web task — it explores the page, learns from failed attempts, iterates until it finds a reliable workflow. Gets smarter every time it runs, automatically. Downstream of pi-autoresearch but applied to browser automation specifically.

💡#26

@romovpa
https://x.com/romovpa/status/2037193952357413058
Autoresearch can discover SOTA white-box adversarial attacks on LLMs. Gave Claude 30+ existing GCG-like algorithms and access to a compute cluster. Claude quickly learned to combine them into new methods that outperform all existing ones. The application is unsettling — autoresearch loops applied to offensive security research surface novel attacks faster than the defender can patch them.

💡#27

@iuditg
https://x.com/iuditg/status/2033370760690233573
500+ stars in 3 days of release on her Autoresearch fork. The community-built ecosystem around Karpathy's original is now a small economy of variants, each tuned to a domain or hardware constraint.

💡#28

@jingwangtalk
https://x.com/jingwangtalk/status/2053006361596710945
Atari benchmark inversion: instead of training an RL policy to maximize reward, let Codex auto-research a rule-based program that maximizes Atari game score. The author's read — autoresearch in this shape is heuristic learning + search, the same idea that drove operations research's tabu/genetic/particle-swarm algorithms decades ago. Karpathy's "human-out-of-loop" framing is the same bet: design good harness + verifiable reward, let the agent search.

💡#29

@AnnikaSays via @petergyang
https://x.com/AnnikaSays/status/2052779293349224932
"Almost all of my chat-shaped work is now happening in Claude Code." Why: the context that sits on her machine gives 10x more usable output even when the type of exchange is the same. The agentic loop wins not because the model is smarter, but because the surrounding state lets the same model do useful work.

💡#30

@MemoriaDA_
https://x.com/MemoriaDA_/status/2052653191863369935
Open infrastructure for agent memory persistence — agents forget everything on restart, MemoriaDA stores agent memories on 0G storage and anchors them onchain. Agentic loops at scale need memory that survives restart and is auditable; the alternative is amnesiac agents that re-derive context every session.

📡 Eco Products Radar

Eco Products Radar

💡#31

Tools/products that surfaced 3+ times across the day's autoresearch and agentic-loop discussions:

pi-autoresearch / Karpathy autoresearch (40+) — the reference implementation everyone forks, applies, or compares to.

Claude Code (50+) — the harness most autoresearch experiments are layered on, the agent loop most-cited as the production benchmark.

Codex / OpenAI Codex (30+) — the parallel-execution counterpart, /goal feature explicitly closed the autoresearch gap on May 8.

Cursor (10+) — /orchestrate SDK shipped recursive agent spawning and concrete production receipts.

DeepSeek V4 Pro (10+) — model swap target for cheap agentic loops; DeepClaude reuses Claude Code's harness with DS V4 underneath.

OpenClaw (15+) — referenced as the open agent runtime for autoresearch experiments, Autobrowse Skill the day's concrete receipt.

WorldSeed (5+) — autoresearch composition framework that returned 72 peer-reviewed papers from 3 agents overnight.

HALO / HALO-RLM (5+) — recursive self-improvement framework, +15.8 AppWorld points using harness-trace analysis.

DeepClaude (3+) — Claude Code agent loop running on DeepSeek V4, claimed 17x cost reduction.

MiniCode (3+) — open Claude Code twin for understanding the architecture from source.

Hooks (15+) — extensibility primitive, shipped by Claude Code, now coming to Codex.

MCP / Model Context Protocol (10+) — the integration layer beneath every harness conversation today.

Skills / SKILL.md (15+) — the unit of reusable agentic expertise, increasingly the artifact autoresearch loops produce.

Stagehand (5+) — browser-side abstraction layer making the agent loop less brittle on web automation.

Polymarket (5+) — the market venue most-cited as autoresearch loop target for trading agents.

Shopify (5+) — pi-autoresearch internal use cited with hard production numbers.

← Previous

Super User Daily: 2026-05-10

Ideas Radar: 2026-05-10

← Back to all articles

Loop Daily: 2026-05-10

More Articles

Comments