May 18, 2026loop

Loop Daily: 2026-05-18

Karpathy's "Loopy Era" podcast finally cashed in over the weekend. What started as a phrase on No Priors became a movement on Saturday: people are now publicly burning weeks of compute to let agents iterate on their own. The headline cases tell the story — somebody let GPT-5.5 run autoresearch in /goal mode for 150+ hours and it's still improving, ChrisHayduk is openly bullish on autoresearch for biology because bio is more "intelligence constrained" than LLMs, ARIS open-sourced overnight science where executor and reviewer come from different model families so they can't share blind spots. Meanwhile the methodology layer is consolidating around one phrase Anthropic's Cat Wu used: "the harness is the product." The bet underneath all of these is the same — keep the harness lean, let the model run longer, watch the loop converge to something a human couldn't produce in a quarter. Below are the most concrete loops shipped Saturday.

💡#1

@hive_echo
https://x.com/hive_echo/status/2055787667699421686
GPT-5.5 has been running in /goal mode as autoresearch for 150+ hours straight and is still improving. The author isn't even sure if it's still running. This is the single cleanest data point for the "loopy era" — a human-scale week of continuous self-directed work that hasn't hit a clear stopping condition.

💡#2

@ChrisHayduk
https://x.com/ChrisHayduk/status/2055786499090596113
Bullish on autoresearch for AI-in-biology specifically because bio is "intelligence constrained" while LLMs aren't. His argument: bio has many niche domains each needing their own datasets and inductive biases, so talent gets diluted across many problems. Where LLM researchers cluster on a few hyper-generalizable problems, bio researchers don't have that luxury. Autoresearch — agents running thousands of parallel experimental iterations — is therefore a bigger multiplier in bio than in LLM work.

💡#3

@ChrisHayduk
https://x.com/ChrisHayduk/status/2055771833400488227
Specific commit: starting a minimal reimplementation of AlphaFold 3 to run an autoresearch loop on top of it. This is what "the loopy era for biology" looks like in practice — take a frontier model architecture, reimplement minimally, then let agents iterate on improvements overnight. Same pattern Ole Lehmann ran on landing-page skills (56% → 92%), now pointed at protein structure prediction.

💡#4

@ChrisHayduk
https://x.com/ChrisHayduk/status/2055758091526799404
The code and starting vision for the autoresearch run is public. The unusual thing here is the open transparency: most autoresearch experiments are framed as black-box demos, but Hayduk is publishing both the run target and the run state so others can fork the loop.

💡#5

@Xudong07452910
https://x.com/Xudong07452910/status/2055789648233005382
ARIS — extremely lightweight autoresearch harness that works with Claude Code / Codex / Cursor / Trae / Chinese models. Reads papers → finds weaknesses → generates ideas → designs experiments → iterates → writes full papers + rebuttal prep + slides + poster. Pure Markdown skills, no framework lock-in, swap models freely. The framing: "daytime you direct, nighttime AI explores like crazy, wake up the paper has leveled up."

💡#6

@itarutomy
https://x.com/itarutomy/status/2055501326948143127
The serious technical read of ARIS. Identifies the real risk in long-horizon agent research: it's not "AI breaks" — it's "AI plausibly lies." Numbers can be real but the supporting evidence is thin. ARIS calls this "plausible unsupported success" and treats it as the #1 risk for single-agent long-horizon work. Solution: executor and reviewer in different model families (e.g. Claude executor + GPT reviewer), so they don't share blind spots. Plus a 3-stage audit cascade — experiment integrity check, evidence-claim mapping, numerical consistency check across the paper — each handled by independent AI. One 8-hour overnight run: review score 5.0 → 7.5/10, 20+ GPU experiments executed, unsupported claims auto-removed. The paper itself was drafted with ARIS in the loop.

💡#7

@itarutomy
https://x.com/itarutomy/status/2055610989521801323
A systematic paper on multi-agent RL — Dr. MAS — showing that single-agent RL methods like GRPO destabilize when applied directly to multi-agent coordination. Biggest problem: "credit diffusion" — the longer the orchestration trace, the more noise dilutes which decision actually drove the outcome. KIMI's PARL gets around this with a staged annealing reward formula. Credit attribution is broken into 8 layers (team → orchestrator → role → agent → turn → message → tool → token). One huge open gap: zero existing methods explicitly RL-train the orchestrator's "when to stop" decision. Current systems just terminate on external rules. Cites Claude Code's subagent feature and Anthropic's 16-parallel-Claude C-compiler case as industrial proof points.

💡#8

@connordavis_ai
https://x.com/connordavis_ai/status/2055575644881494389
Cat Wu's Ars Technica interview is the cleanest articulation of where Claude Code is heading. Two phrases jumped out — "lean harness" and "usage limits are a transparency problem, not a pricing problem." Wu's framing: the harness (planner, tool router, file system loop, eval loop, memory layer) is being kept deliberately thin. Not because Anthropic can't ship more — because every layer eats tokens, slows the model, and locks users into specific abstractions. The long bet is that the model improves faster than the harness can be optimized, so keep the harness minimal and let the model do more lifting. This is the opposite design philosophy from most coding agents.

💡#9

@sudoingX
https://x.com/sudoingX/status/2055548902099894480
Operator-grade tuning advice for Hermes Agent on local models. The agentic loop has three configurable knobs that matter on slow inference: bump max_turns from 30 to 50 (frontier defaults are too tight for local models), raise gateway_timeout from 600 to 1200 (12-17 tok/s will timeout silently and look like crashes), and turn on auto-reset for context (sessions accumulate until you /reset, choking the loop). If you're running anything under 20 tok/s locally, these three are the difference between "broken" and "flying."

💡#10

@hu_yifei
https://x.com/hu_yifei/status/2055458233779962142
"I spend more than $2000 per month on codex. I use api key to ignore rate limits. If there's a $2000 plan that can support my daily auto research use, I would be happy to switch." This is the consumption ceiling — somebody actually wants to pay $2K/month for unbounded autoresearch. Signals an emerging tier of customer who needs the loop to never stop.

💡#11

@nanobot_project
https://x.com/nanobot_project/status/2055654391424913861
Release notes for a lean open-source agent framework: /goal for sustained objectives across turns, image generation end-to-end, WebUI in the wheel, 5 new providers + fallback_models, and "a real agent-loop refactor." 105 PRs, 33 contributors, 20 newcomers in one cycle. The agent-loop refactor line is what to watch — open-source loops are catching up to Claude Code's harness model fast.

💡#12

@BretKerr
https://x.com/BretKerr/status/2055696079874609183
Heavy-duty agentic loop implementation in production. Building a verifier-grounded book about Anthropic. Pipeline: BM25 + KNN (voyage-3-large) parallel → Reciprocal Rank Fusion → Voyage rerank-2 → Claude Sonnet agentic loop with 3 tools (fetch_neighbors, search_again, done; 4-iteration max) → Claude extracts verbatim passage + citation. The agentic stop condition is Claude itself deciding when it has enough — no fixed depth. Verifier is literal substring match after normalization, so if Claude hallucinates or paraphrases, the quote doesn't make it into the book. Verifier turns the corpus from "unattestable memory" into permanently citable source material. The attestation layer is the moat, not the generation layer.

💡#13

@thejayden
https://x.com/thejayden/status/2055745679599804848
One-line prompt that's the simplest entry to the self-improving loop: "Turn this workflow into a self-improving SKILL.md system that compounds after every run." Saturday's most repostable framing of the loop concept — anyone with a Claude Code session can run this against any workflow they have.

💡#14

@scion_enjoyer
https://x.com/scion_enjoyer/status/2055573970372448269
Richard Socher's Recursive Superintelligence emerged from stealth with $650M. The framing isn't "another AI startup" — it's that the funded thesis is now "systems that identify their own weaknesses and improve themselves." The race is shifting from "best chatbot" to "best self-improving research engine." Watch this as the first $650M-class commitment that's explicitly an autoresearch-loop bet.

💡#15

@Basemail_ai
https://x.com/Basemail_ai/status/2055491563145543891
Tactical roundup: Nof1 raised $15M from SUI Group to build Alpha Arena where AI agents compete in live financial markets. Recursive Superintelligence emerged from stealth $650M at $4.65B (NVIDIA / AMD / GV backed) building self-improving AI. Fiserv picked OpenAI to bring agent tech to financial institutions. WSPN W Agent + NEAR private USDC + Circle Agent Stack all shipping agent payment rails. The frame: AI agents becoming first-class financial actors, and identity / accountability is now the gating problem.

💡#16

@TheValueist
https://x.com/TheValueist/status/2055779908098412608
"$NVDA $MU $SNDK $LITE Don't forget about the power and future development of autoresearch." Short tweet, 4400+ impressions, makes the connection most builders are missing — autoresearch is a compute-thirsty workload, which routes back through memory and optical infra. The macro thesis of the loop era.

💡#17

@Quasymodo71
https://x.com/Quasymodo71/status/2055559893923377216
PrimeIntellect Lab launching signals hosted autoresearch runtimes entering the validation → competition phase. Strong product and real demand but "vendor-centric island" — the coordination layer is missing. Twin tweet 3/N follows Karpathy's framing that the future is large-scale asynchronous SETI@home-style agent networks, not single agents. No single vendor can become the global coordination fabric for that.

💡#18

@rcmisk
https://x.com/rcmisk/status/2055471140970123548
The methodology bottom-line: "autoresearch. The architecture is thin harness + fat skills. The rest is implementation detail. If you've read 3 of the 6 above, you already know more than 90% of people building agents right now." Cleanest one-line framing of where the field has converged.

💡#19

@rcmisk
https://x.com/rcmisk/status/2055471136259846620
Concrete autoresearch case to copy: Ole Lehmann's landing-page skill went from 56% to 92% with zero manual work. Karpathy's autoresearch theory turned into a runnable skill that any Claude Code user can fork. This is the case study most builders should read first.

💡#20

@editxshub
https://x.com/editxshub/status/2055589245893714345
"hooks turn codex from a tool into infrastructure. validators, pre-commit checks, automated review. this is the agent loop you can trust in production. shipped quietly. way more important than the mobile app post got." Captures the real lesson from Codex hooks shipping — production-grade agentic loops require deterministic checkpoints, not just stronger models.

💡#21

@TravelerOfCode
https://x.com/TravelerOfCode/status/2055490820632203433
"Our team rebuilt every internal tool as MCP servers + agent loops and the UI became a debug artifact. Headless is just the agent IS the interface now." The headless-first design philosophy in one line. Agents stop being "an assistant in a UI" and become "the runtime that occasionally needs a UI to debug."

💡#22

@PsudoMike
https://x.com/PsudoMike/status/2055448731491700996
"This is the agent loop everyone keeps underselling. Once the tool surface is stable, /goal turns into a competent planner. Model gets the credit but the tools do most of the actual work." 4600+ impressions, captures the truth most Claude Code commentary misses — the loop's competence comes from the tool surface plus /goal, not the model upgrade.

💡#23

@stometaverse
https://x.com/stometaverse/status/2055480352312004746
"The agentic CLI space is getting crowded — Claude Code, Cursor, Codex, now Grok Build. What'll matter most is agent loop reliability. Gating behind Heavy subscription shows xAI treats this as a real revenue line." The reliability framing matters: the differentiator is no longer the model, it's how often the loop doesn't stall.

💡#24

@matt_diak
https://x.com/matt_diak/status/2055453120080248881
"Once the agent loop is solid the screen becomes a monitoring surface, not a workspace. I kicked off a few agents this morning from my phone and only sat down to review diffs. The hard part is trust calibration." This is the lived experience of the loopy era — the screen demotes itself from primary workspace to monitoring dashboard, and the human bottleneck moves to trust calibration.

💡#25

@im_comatose
https://x.com/im_comatose/status/2055720812448235583
"The Agentic Loop (The Fiverr Killer): Old way — human always approves payment → loop broken. New way — Agent A hires Agent B → escrow locks → work delivered → auto-release. Zero human intervention. This is how machine-to-machine economies actually scale." Worth saving as the cleanest articulation of why payment friction is the current ceiling on multi-agent autonomy.

💡#26

@stevehou
https://x.com/stevehou/status/2055655476939882877
"Just like there's FOMO in the stock market for the latest hot AI stocks, I'm starting to think that there's FOMO in enterprise adoption of Anthropic's Claude, esp Claude Code." Enterprise FOMO is now a measurable Claude Code adoption driver — combined with The Verge story about Microsoft revoking internal Claude licenses, this is one of the clearest signals the loop is being noticed at the org level.

💡#27

@mildsky1215
https://x.com/mildsky1215/status/2055441667730321672
"Each post = experiment. Karpathy AutoResearch pattern applied. 24h after publish, engagement_analyzer scores by reply, bookmark, retweet, like, over views. Weekly retro reads the log, drops losing forms, weights up winners. System rewrites itself." Smallest possible example of autoresearch applied outside ML — a content-writing self-improvement loop. Anyone with a Twitter account can replicate this stack.

💡#28

@Quasymodo71
https://x.com/Quasymodo71/status/2055559898033758225
"As Karpathy highlighted: the future isn't a single agent. It's large-scale, asynchronous, SETI@home-style agent networks. No single vendor platform can become the global coordination fabric for that." This is the macro disagreement underneath the current consolidation wave — if Karpathy is right, the next infrastructure layer to build isn't a better agent runtime, it's a coordination protocol between agents.

💡#29

@m13v_
https://x.com/m13v_/status/2055768947212124281
"MCP is the one that changes most once you actually use it. A tool can do more than fetch data, it can drive real macOS apps via accessibility APIs, so the agent loop stops ending at the terminal. We built macOS MCP exactly for that, it drives apps through the accessibility tree so the loop runs past the terminal." MCP as the bridge that lets agentic loops escape the terminal sandbox into native apps — quietly one of the biggest architectural shifts of the quarter.

💡#30

@TheWeb3Patriot
https://x.com/TheWeb3Patriot/status/2055588630849110084
DKG v10 Bounty connecting ChatGPT, Claude, OpenClaw, Hermes etc. to a 3-layer trust gradient memory (Working = raw agent scribbles, Shared = collaborative context, Verified = blockchain-anchored). The pitch: this is the open substrate for real multi-agent swarms and Karpathy-style autoresearch loops, and agent-native writes are the missing piece flagship builds should target.

💡#31

@JulianGoldieSEO
https://x.com/JulianGoldieSEO/status/2055571979420488130
The "company of agents" pattern with an org chart instead of a single agent. CEO agent, marketing agent, SEO agent, content agent, support agent. Set the mission once → build the team → drop tickets → agents wake up on schedule, pick up tasks, do the work, report back. Can plug in Claude Code, Codex, OpenClaw, Pi, or Cursor as different roles. Solo agent waits for prompts; company of agents works toward your mission while you sleep.

📡 Eco Products Radar

Eco Products Radar

ARIS — autonomous research harness (3+ deep-read threads Saturday, including the Japanese technical breakdown)

/goal command — the actual primitive making overnight loops reliable (8+ standalone tweets, multiple "Claude Code /goal turned 3-hour babysitting into walk-away workflow")

MCP (Model Context Protocol) — the protocol letting agentic loops escape the terminal into native apps (multiple mentions including macOS MCP)

Hermes Agent (Nous Research) — self-improving open agent, now with Grok 4.3 + X Premium subscriptions plug-in (mentioned in nearly every "self-improving" thread)

Recursive Superintelligence (Richard Socher's stealth, $650M) — first $650M-class explicit autoresearch-loop bet

PrimeIntellect Lab — hosted autoresearch runtimes entering competition phase

Karpathy autoresearch framework — the methodology root cited across Saturday's most-shared posts (No Priors "Skill Issue: Code Agents, AutoResearch, and the Loopy Era")

← Previous

Super User Daily: 2026-05-18

Ideas Radar: 2026-05-18

← Back to all articles

Loop Daily: 2026-05-18

Related Articles

Comments