May 14, 2026loop

Loop Daily: 2026-05-15

May 13 was Karpathy's autoresearch frame quietly going mainstream. The pattern — point an agent at a measurable objective, let it propose variants, accept the wins, roll back the rest, ratchet forward — kept showing up under five different names this Tuesday: autoresearch, agentic loop, agent loop, /goal mode, Ralph loop. The interesting thing isn't the rebranding. It's that the same loop is now eating ML research, business decisions, prediction markets, security auditing, and operational analytics in the same week. Below is what people actually shipped with it.
💡#1
@vesslai
https://x.com/vesslai/status/2054713187598307764
Ran Karpathy's autoresearch reference benchmark on VESSL Cloud Job + CLI. Beat the reference val_bpb 0.9856 vs 0.9979, on the same H100, for the same $5.28 bill. Experiment time collapsed from 2 hours sequential to 40 minutes parallel. This is the data point that closes the loop on "is autoresearch real" — same hardware, same money, better result, faster.
💡#2
@MajorTimbWlf21
https://x.com/MajorTimbWlf21/status/2054440847459139721
Doing a talk at the CAISc autoresearch workshop May 21 specifically because the standard "agents can't do science" angle is wrong. He and others have been using these systems for real neuro-inspired ML research and he wants to walk through what actually works. Treats this as "genuinely the future" rather than a stunt.
💡#3
@lossfunk
https://x.com/lossfunk/status/2054433078861611457
Extended CAISc 2026 submissions to May 30 and scheduled three preconference workshops on the three concrete ways LLMs are automating science: agent skills for domain-specific tasks, autoresearch loops, and agentic coding tools for end-to-end research projects. Led by Rahul Sundar from Dhyuti Labs, Timb Wolf extending Karpathy's autoresearch for neuro-inspired ML, and Paras Chopra demoing autovoila for running research with CLI coding agents. Closest thing the field has to a coordinated curriculum on this stack.
💡#4
@leo_liuye
https://x.com/leo_liuye/status/2054563111181680870
Took the exact same Karpathy autoresearch loop for code and ran it on business decisions instead. Agent proposes a strategy, tests against two years of company data, iterates overnight. Woke up to the agent catching a pricing anomaly that would have cost the company six figures. He frames it as "SETI@home for operations" — a CFO-grade machine running while you sleep.
💡#5
@__marmikpandya
https://x.com/__marmikpandya/status/2054476689149927721
Building Pepper as an "IDE to engineer businesses" — multi-agent swarm plus Karpathy-style autoresearch optimization for any business metric. The concrete example: optimize activation rate. Pepper autonomously runs onboarding experiments via code changes + long-term monitoring. Same primitive, applied to the conversion funnel instead of a model architecture.
💡#6
@aeonframework
https://x.com/aeonframework/status/2054540257295548478
Aeon's 24 skills shipped into bankrbot's wallet. The two worth flagging beyond the crypto wrapping: autoresearch as a runnable skill that auto-upgrades any other skill (generates 4 variants, picks the best, never regresses), and skill security scan that audits skills before you run them (flags malware, hidden unicode, secret exfil). The autoresearch primitive went from research paper to wallet-installable skill in roughly six weeks.
💡#7
@ChrisHayduk
https://x.com/ChrisHayduk/status/2054400729708654608
Two different agent-loop archetypes side by side from one practitioner: an autoresearch-type setup running on a side project to optimize a protein structure prediction model, and a Ralph loop at work to implement PRDs and exec plans. The split lines up exactly — autoresearch when the objective is measurable and the search space is huge; Ralph loop when the spec is the constraint and the agent just needs to grind to done.
💡#8
@usr_bin_roygbiv
https://x.com/usr_bin_roygbiv/status/2054410511127597115
"Whenever I have idle compute I run autoresearch loops on evals to tune configs." The one-line version of what's about to become the default for any team that has a GPU sitting idle overnight. Idle compute as a research budget, with the agent as the night-shift researcher.
💡#9
@FormTalker
https://x.com/FormTalker/status/2054668529723703621
Quotes a concrete autoresearch outcome: Karpathy's loop ran 700 experiments in 2 days improving learning speed 11%; Shopify's CEO independently confirmed a 19% performance improvement overnight on his side. The numbers matter less than the time horizon — "improving while you sleep" is now a routine ops claim, not a marketing one.
💡#10
@JeremyNguyenPhD
https://x.com/JeremyNguyenPhD/status/2053082260132573517
Quoting Prof Jie Ding at the University of Minnesota: "I left 3 AI agents alone with a research problem overnight. They came back with 72 peer-reviewed papers." Ding open-sourced Autoresearch and WorldSeed where you compose agents by talking. This is the academic-side proof point that overlaps with what private labs are doing in their own clusters.
💡#11
@IAmSandroSaric
https://x.com/IAmSandroSaric/status/2054672335291212214
The cleanest end-user write-up of the agentic loop autoresearch pattern. Four steps: clone a starter into your project folder, measure a number, write a program.md telling the agent what to try plus the constraints, let it rip. Worked example: cut a SaaS dashboard's JS bundle from 780 KB to under 500 KB while you go for a smoke. He calls out the failure mode by name — if you skip the constraints section the agent will "do whatever it wants." Applicable to landing pages, outreach copy, anything with a measurable outcome.
💡#12
@its_brill_
https://x.com/its_brill_/status/2054636613154767324
Running an autoresearch loop on brillbet, a sports prediction product, where raw data comes in, an engine decides which combinations to test, results flow through the autoresearch loop that decides whether a combination actually predicts game outcomes. The interesting workflow detail: asked Codex to render the system as HTML rather than text, because "if the answer is hard to understand as text, don't ask for better text, ask for a better surface." Non-engineer prompting around the loop, not inside it.
💡#13
@palqa_
https://x.com/palqa_/status/2054626601258643842
Concrete cost numbers for anyone running long agent loops: a 30-step agent loop costs ~$24/run on Opus, ~$1.40 on Kimi 2.6, with "output quality nearly identical" for most coding tasks. His routing recommendation: Opus for architecture, Kimi for serious implementation, Haiku for cleanup, local for autocomplete. Refactor a 500-line file — Sonnet $0.12 vs Kimi 2.6 $0.04. Worth pinning if you're about to start running agent loops 24/7 against the June 15 SDK credit cap.
💡#14
@iamkunhello
https://x.com/iamkunhello/status/2054425049239879685
Counter-anchor for the same conversation. "One bad AI agent loop can burn $10k+. Real case: 49 parallel subtasks, 2.5 hours → $8k–$15k. Not a bug. Just math." The 200K context window is a "desk, not a brain" — what falls off is gone, and you pay to carry everything every turn.
💡#15
@JustAnotherPM
https://x.com/JustAnotherPM/status/2054546468955148776
Real incident: a developer let Claude Code run against a production database with no guardrails, it deleted everything. His five-minute three-hook fix is the practical takeaway: PreToolUse hook to block any SQL containing DROP/DELETE/TRUNCATE against prod, PostToolUse hook to run tests after writes to deploy dirs, SessionStart hook to reject prompts touching files outside the project. "Hooks run outside the agent loop. They are deterministic. The model cannot override them." This is the safety surface most teams ship without.
💡#16
@MindTheGapMTG
https://x.com/MindTheGapMTG/status/2054624155618738560
Pushback on the "orchestrate agents in a UI" framing that keeps coming out of Notion / Make / n8n. "Orchestration in a UI breaks at 3am when nobody's watching. Production agents need heartbeat cycles, scoped file permissions, and CLI recovery tools. Database views are great for demos. Terrible for debugging why your agent loop ate 40K tokens on a hallucination."
💡#17
@bettercallsalva
https://x.com/bettercallsalva/status/2054664488213688794
The honest take on /goal: it feels like the agent loop people described in papers two years ago, finally usable. The caveat is also the lever — it only works if your test/lint signals are tight, otherwise the agent thinks it's done while the build silently breaks. Tight feedback signal is the prereq, not the bonus.
💡#18
@_michaelmoreira
https://x.com/_michaelmoreira/status/2054529169166393441
Closes the agentic loop on a CI workflow: agent commits code → pipeline fails → auto-heal opens a fix PR in under 30 seconds → `floweasy status` in Claude Code surfaces the state. The deployment is unsexy but it's a real production loop with all the pieces — failure detection, autonomous fix, human-readable status — wired together in MCP.
💡#19
@simulx4
https://x.com/simulx4/status/2054659480034644301
Real economics for the long-context agent loop crowd. Codex CLI can bind to Cloudflare's Qwen3 deployment at $0.05 per million input tokens. If you need an agent to look at a million tokens inside a loop and make a call, that's now five cents. The arbitrage between premium API loops and open-weight inference loops keeps widening.
💡#20
@ozkatz100
https://x.com/ozkatz100/status/2054652225553666229
Tilde added Google Drive as a first-class source. Point Tilde at a folder and your agents can read PDFs, slides, docs, images, and videos as if on local disk — no SDKs, no auth handling inside the agent loop, no glue code. This is the unflashy piece of infrastructure that turns "my agent doesn't know about my company's actual files" into a non-issue.
💡#21
@LangChain_OSS
https://x.com/LangChain_OSS/status/2054641656222388700
LangChain shipped harness profiles for per-model tuning with support for Kimi, Qwen, and DeepSeek, plus a code interpreter as a programmable runtime inside the agent loop, plus DeltaChannel for efficient agent checkpointing, plus ContextHubBackend for skill/policy/memory storage. The interpreter-inside-the-loop piece in particular is what users have been asking for to replace 80% of the custom tool surface people otherwise build.
💡#22
@om_patel5
https://x.com/om_patel5/status/2054401992642936843
The cautionary tale on what happens when an agent loop runs without a curator. Inherited a 3-month backend that the previous team had celebrated as "advanced agentic engineering" — 220 route handlers (20 used), 309K lines of code under 240K lines of generated docs, 1M+ lines of agent logs sitting in markdown. The loop ran. It just ran building features nobody asked for. He rewrote it in one week with Claude Code keeping same functionality plus real architecture. The lesson is taste, not orchestration: agent loops produce a lot, and most of it isn't shipping software.
💡#23
@AdeCubedinc
https://x.com/AdeCubedinc/status/2054510844436709766
The sharpest counter to the "rewrite-in-a-week" narrative. "An agent loop with no human doing aggressive cherry picking will produce 220 routes when you needed 20. That's not an orchestration problem, it's a taste problem." The rewrite worked because the first build did the discovery. The actual question for agent-loop operators isn't "more loops, fewer loops" — it's "who's deciding what gets shipped."
💡#24
@cantinasecurity
https://x.com/cantinasecurity/status/2054591347873681882
Apex, Cantina's autonomous AppSec agent, found three vulnerabilities Apple just patched in WebKit, including one 13-year-old bug. Two of the three were CSP bypasses. The agent isn't pitching better fuzzing — it's pitching that an agent loop with the right tools can sweep a codebase the size of WebKit and surface real CVEs faster than any human review schedule. Security is now an autoresearch domain.
💡#25
@Ternoa_
https://x.com/Ternoa_/status/2054399741233160368
Ternoa shipped TIP Verify for Hermes Agent. Anyone running Hermes can now verify that their local install matches a known source snapshot registered on Ternoa zkEVM, with the full manifest stored on IPFS. As supply-chain attacks against AI agent ecosystems multiply (see the Shai-Hulud campaign in the Claude Code space the same day), this is the missing primitive: cryptographic proof that the agent code you're running is the agent code you think you're running.
💡#26
@0gclawforge
https://x.com/0gclawforge/status/2054515373584654800
Launched 0GClawForge, billed as the first complete OpenClaw-powered sovereign agent OS. The stack: TEE inference + permanent 0G memory + zero context loss + multi-agent system orchestration. The pitch is "mint, orchestrate, own and evolve" agent systems on chain. Whatever you think of the crypto wrapper, the technical idea — agents whose memory is verifiable and whose inference is in a trusted execution environment — is the part the agent-loop community is going to have to solve for production anyway.
💡#27
@MakeAI_CEO
https://x.com/MakeAI_CEO/status/2054701758434484488
Cites Bui et al's March 5 2026 arXiv paper that broke the agent loop's standard "plan → act → observe → adjust → repeat" into a six-stage cycle: precheck → thinking → self-critique → action → tool execution → postprocess. The interesting bit isn't the count — it's that "self-critique" became a named stage with a literature. Inserting an explicit critique step before action is what made the loop converge in places it didn't before.
💡#28
@glitchtruth
https://x.com/glitchtruth/status/2054524505012506626
The frame nobody at Anthropic is saying out loud: "Codex shipped /goal-style autonomous loops months ago, Cursor's been doing background agents since Anthropic was still pushing single-turn tool use. The real tell is Claude Code 2.1 finally admitting the agent loop belongs in the client, not the model. Sonnet 4.6 is still the better coder per token, GPT-5.5 just has the harness lead right now." If the loop lives in the harness, the model is a swappable substrate. That's the upstream question pricing this week tried to answer.
💡#29
@hxiao
https://x.com/hxiao/status/2055052551318573552
"Terminology update: semi-supervised learning today is basically AK's autoresearch + steering. Unsupervised learning when?" The most compressed observation of the week. Autoresearch is not a new agent pattern, it's the next layer of the ML methodology stack getting rebranded for the LLM era.
💡#30
@ttorres
https://x.com/ttorres/status/2054611139623846155
Rebuilt AI-generated opportunity solution trees for Vistaly from the ground up after the demo prototype hit real usage. The actual breakthrough wasn't "fix it with more code" — it was letting the model self-correct, then wrapping that self-correction with validation tools. "An agent loop with validation tools can turn an impossible problem into something manageable." The pattern generalizes well past PM tooling.
📡 Eco Products Radar
Eco Products Radar

Karpathy's autoresearch — the named framework most of the loop conversation is downstream of this week.
Claude Code — still the default agent harness running these loops on the developer side; /goal command made the loop "Ralph-style" out of the box.
Codex — explicitly cited as having shipped /goal-style autonomous loops "months ago," and the harness people are migrating to after the Anthropic SDK credit announcement.
Cursor — running background agents since before Anthropic shipped tool use; their /orchestrate skill spawns sub-agents recursively.
Hermes Agent — paired with OpenClaw across multi-agent setups, and the first agent runtime to get a public TIP Verify primitive on Ternoa.
OpenClaw — the multi-agent dispatcher most named outside the coding harness conversation; 0GClawForge is the first attempt at a "sovereign" version.
Kimi K2.6 — repeatedly cited as the "good enough for 90% of loop work" model at a fraction of Opus cost.
DeepSeek / Qwen3 / GLM — open-weight models showing up in LangChain harness profiles and Cloudflare deployments for cheap-loop economics.
Aeon framework — first to package autoresearch as a runnable skill on a consumer wallet (via bankrbot).
LangChain — shipped harness profiles, code interpreter inside the agent loop, DeltaChannel checkpointing, and ContextHubBackend this week.
VESSL Cloud — the platform where someone actually beat Karpathy's autoresearch benchmark on equal hardware.
Lossfunk / CAISc 2026 — the autoresearch academic gathering point this month.
Tilde — Google Drive as first-class source for agent reads.
WorldSeed — Prof Jie Ding's open-source companion to Autoresearch, compose agents by talking.
← Previous
Super User Daily: 2026-05-15
Next →
Ideas Radar: 2026-05-15
← Back to all articles

Comments

Loading...
>_