May 20, 2026loop

Loop Daily: 2026-05-19

This week the loop stopped being a niche obsession and became the headline: Karpathy's open-source autoresearch project got pulled into Anthropic, and the whole timeline finally clocked what the loop people have known for months, that the frontier is no longer the smartest single answer, it is an agent given a script, a frozen metric and a compute budget, left to iterate overnight. Underneath the acquisition noise the real builders kept shipping the unglamorous parts: cost-optimized loops, compliance-baked loops, honest benchmarks that show how far autonomous research still has to go, and the same brutal lesson repeating, the model is half the equation and the loop is the other half.
πŸ’‘#1
@ihtesham2005
https://x.com/ihtesham2005/status/2056802098822685052
He dug into the new AI for Auto-Research paper and pulled out the numbers that actually matter. The AI Scientist generated complete papers at roughly $15 each. FARS ran for 228 hours, burned 11.4 billion tokens, and produced 100 papers, one every 2.3 hours. ARIS ran more than 20 GPU experiments overnight, removed weak claims, and pushed a draft score from 5.0 to 7.5 through review-and-revision loops. His sharp take: the cost of producing a paper is collapsing, but the cost of trusting one is about to rise, so verification becomes the edge.
πŸ’‘#2
@ChrisRyViss
https://x.com/ChrisRyViss/status/2056789460319392160
He laid out a full self-improving trading brain built on Karpathy-style autoresearch, one loop per sub-brain. Four main sub-brains, backtested libraries, live market data, breaking news and test results, each feeding the next. Every night it runs EOD test reports that tweak weighted ratios and tactics, run multi-timeframe candlestick and value-gap analysis, and adjust the algorithms. The libraries feed the knowledge while the live RSS feeds supply the variables, and the whole thing compounds toward ever-improving win-rate alerts. This is autoresearch pointed at a market instead of a training script.
πŸ’‘#3
@dosco
https://x.com/dosco/status/2056551223495643418
He ran an experiment a while back: Codex built a zero-dependency reverse proxy with HTTP 1/2/3 support over a weekend, faster than Cloudflare's Rust one and nginx, written in Go, then improved it in an auto-research style loop. He gives full credit to Go's builtin testing and benchmarking support and its solid standard library, which let the model move fast and verify quickly. That is the whole trick to a working autoresearch loop, a tight, fast, objective feedback signal the agent can iterate against without a human in the chair.
πŸ’‘#4
@SolRouterAI
https://x.com/SolRouterAI/status/2056755537149067582
They replaced the free-form agent loop with what they call Guided Reasoning Diagrams, Mermaid flowcharts the agent walks node by node. Action nodes call tools directly with zero LLM calls, and branches use a tiny solver costing about 100 tokens. The logic for how to research a token is encoded in the graph instead of being re-derived from scratch on every query. The result they report: 2x faster at a third of the cost versus a standard agentic loop. This is the quiet counter-trend, constraining the loop instead of letting it free-roam.
πŸ’‘#5
@isaac_ar
https://x.com/isaac_ar/status/2056737136183833056
He built what he says is the first end-to-end AI agentic business workflow with compliance baked into the agentic loop itself, not bolted on afterward. Getting the architecture right took rebuilding the app three times over ten months. He landed his first enterprise clients for a pilot before raising any money. His closing line is the whole ethos of this corner of the ecosystem: you can just do things.
πŸ’‘#6
@liliangjya5
https://x.com/liliangjya5/status/2056736692581892374
His ICML workshop paper makes a deceptively practical point: as teams deploy agents heavily, token cost becomes an infrastructure problem, and the fix is not where the tool runs but how much the agent has to think before it can act. The optimization path is script to CLI to hook. The headline numbers on a Claude Sonnet 4.6 report task, cold cache: a plain script saves only 2.2%, a lazy MCP 10.9%, a CLI 56.4%, and a hook 80.5%. Same task, wildly different agent decision volume. The lesson is to make the right tool visible at the right time with the smallest useful schema.
πŸ’‘#7
@aipulseda1ly
https://x.com/aipulseda1ly/status/2056866007851880568
He ran the same Gemini 3.5 Flash model two ways and the gap is the whole story of the loop. In the AI Studio chat UI with high thinking, it one-shotted 800 lines in 10 seconds, zero errors. Through the Antigravity Agent API, the exact same model produced 1800 lines with far more detailed architecture, spending 4 minutes thinking across 4 iterations on a single pass. His point lands: this is not a model upgrade, it is what happens when you put a model in an agentic loop with real compute behind it.
πŸ’‘#8
@Danny_H_W
https://x.com/Danny_H_W/status/2056831148349632688
A talk at React Native OPO previews a genuinely non-pretraining use of autoresearch: Stop Profiling, Start Prompting, Infinite Auto-Research Loops with MCP. The pitch is using an auto-research skill connected to a React Native app via metro-mcp to autonomously improve the app's performance. This is the loop pattern escaping the ML lab, an agent iterating on real performance metrics in a production mobile codebase instead of a training script. Worth watching as autoresearch generalizes beyond model R&D.
πŸ’‘#9
@hasantoxr
https://x.com/hasantoxr/status/2056808307155984570
Someone built ARGO, a local agent that runs Manus-style autonomous task execution 100% on your laptop, no cloud, no monthly fee, no data leaving the machine. You describe a task and it plans the steps, calls the tools, runs the loop and writes the report, offline. The reason it works where most local tools fail: it ships a full multi-agent task engine, intent recognition, planning, execution, tool calling, self-reflection and self-summary, plus human-in-the-loop where you tweak the plan in natural language before it runs. The autonomous loop, fully self-hosted.
πŸ’‘#10
@IntologyAI
https://x.com/IntologyAI/status/2056764236668493868
They released NanoGPT-Bench, an internal eval that tests agents on a real AI R&D problem with months of human progress behind it. The headline result is a cold shower: Codex, Claude Code and Autoresearch recover only 9.3% of human progress, mostly by tuning hyperparameters while ignoring actual algorithmic research. Evaluation is fully autonomous and end-to-end, no human intervention, no internet, standardized to a 5-month window of world records. It is the most honest data point on where autonomous research loops actually stand today.
πŸ’‘#11
@nateberkopec
https://x.com/nateberkopec/status/2056553254763528359
He pointed autoresearch at a real-world systems problem and got a clean negative result, which is exactly what a good loop is for. Running it against MALLOC_CONF with an actual Rails app, he still sees no combination that improves resident memory without also costing throughput. His recommendation: do not set MALLOC_CONF at all. This is autoresearch as an honest experiment runner, not a hype machine, sometimes the most valuable thing the loop tells you is that there is nothing there.
πŸ’‘#12
@caspar_br
https://x.com/caspar_br/status/2056542918463394038
He calls out a sleeper feature in Claude Managed Agents: interpreters inside the agent loop. Imagine a cloud agent handed a 10,000-row CSV of support tickets. Without a code runtime it is stuck reasoning over raw rows in context, slow and lossy. With an interpreter it writes code in one turn to parse the CSV, group tickets by category, count and sort by frequency, sample three bodies per group and return a small table. A lighter-weight runtime living right inside the loop handles the sandbox-shaped jobs, no bash required. This is how loops stop choking on real data.
πŸ’‘#13
@koder0x
https://x.com/koder0x/status/2056793113763479747
He pushes back hard on the cult of long-running autonomy: delegating long tasks to AI agents is a bad bet, not because AI can't do it, but because you don't know step 3 is wrong until step 6 is done, and every downstream step inherits the poison. The supervision math only works if nothing goes wrong, and something always does. His prescription is tight loops, short delegation, verify, next step, which he argues are not slower, just safer at the same speed. A useful counterweight to the optimize-for-delegation-length narrative.
πŸ’‘#14
@usr_bin_roygbiv
https://x.com/usr_bin_roygbiv/status/2056748460829761628
His agent harness progression is a one-line history of where the loop crowd actually went: claude code, then codex, then droid, then custom Python tmux Codex pipes, then pi with custom extensions for agent swarms and autoresearch, and finally the realization that omp had already done everything he was building, better and faster, at the same time. The honest arc of a power user chasing the autonomous loop and discovering the frontier had already lapped him.
πŸ“‘ Eco Products Radar
Eco Products Radar

Autoresearch (Karpathy) - the open-source overnight-experiment loop that Anthropic acquired this week; the gravitational center of the whole conversation.
Claude Managed Agents - Anthropic's API that splits the agent loop from the sandbox, with self-hosted execution and interpreters living inside the loop.
Google Antigravity - the standalone agent-first desktop and now an Agent API, where the same model behaves very differently once it is inside a real loop.
Codex - the harness people actually run autoresearch-style loops in, from reverse proxies to overnight builds.
NanoGPT-Bench (Intology) - the cold-shower eval showing agents recover just 9.3% of human AI R&D progress.
pi / omp - the power-user harnesses for agent swarms and autoresearch that the frontier crowd keeps migrating between.
← Previous
Super User Daily: 2026-05-19
← Back to all articles

Comments

Loading...
>_