May 4, 2026loop

Loop Daily: 2026-05-05

Karpathy's 630-line Auto Research script is now the gravitational center of the agentic loop conversation — pointed at already-optimized code overnight, it ran 83 experiments, found 15 genuine improvements including an attention bug humans missed, shipped an 11% speedup. 21k GitHub stars in days. The pattern that fell out: anyone with a consumer GPU or a Mac mini can now do the kind of autoresearch that was lab-only three weeks ago. Sunday's threads were less about whether autoresearch works and more about applying it — to financial markets, to code optimization, to bug hunting, to Hermes self-rewriting its own source. Plus the cautionary tales: the $150K Anthropic bill that turned out to be an unsupervised infinite loop, and the eval-overfit problem that nobody is naming.

💡#1

@deltnaodeai
https://x.com/deltnaodeai/status/2050931026134962315
The cleanest framing of why Karpathy's Auto Research matters. He pointed a 630-line Python script at code that was already optimized, left it running overnight. It ran 83 experiments autonomously. 15 were genuine improvements. It found an attention bug humans had missed. 11% speedup on tuned code. The author's frame is right: this is not a coding assistant — this is autonomous scientific experimentation. 21k GitHub stars and counting. The interesting thing is what it implies — the bottleneck on optimization was never the ideas, it was the patience to run 83 experiments overnight. Now patience is a script.

💡#2

@safakkayran
https://x.com/safakkayran/status/2051060142909665589
Karpathy's autoresearch applied to financial markets — ATLAS-GIC. 25+ AI agents debate markets daily across 4 layers. The structural insight buried in the README: "the prompts are the weights. Sharpe is the loss function." The worst-performing agent rewrites itself or gets axed. 18-month backtest: 22% return in deployment. Open source, 1.5k stars. This is the cleanest application of autoresearch to a non-coding domain in this dataset — the agents themselves are the model parameters being trained, with risk-adjusted returns as the gradient. Take the backtest with a grain of salt; the architecture is the interesting bit.

💡#3

@quantscience_
https://x.com/quantscience_/status/2050967676852285544
Same project as ATLAS-GIC, the author's own framing. He explicitly stacks three concepts: Karpathy's Autoresearch + Soros's Reflexivity + MiroFish's Swarm Agents. The interesting compositional move is that Soros's reflexivity (markets are recursively shaped by traders' beliefs about markets) maps cleanly onto multi-agent debate where each agent's read changes what the others see next round. This thread got 1700 impressions in 7 hours and the actual repo link is in the chain. Worth reading even if you have no interest in trading — the "prompts are weights" framing is the takeaway.

💡#4

@oddur
https://x.com/oddur/status/2050848068958884000
Used Karpathy autoresearch to find the best GPU + VLLM + Gemma4 inference config for seedmmo. Gave the agent the evals plus the ability to change configs, left it overnight, woke up to a findings.md at the end. This is the use case most engineers should actually run first — not financial markets, not novel research, just "here's my config space, here's a benchmark, find the best setup." The "findings.md at the end" detail is the part that reads as production-ready. This is the inflection point where autoresearch becomes ops infrastructure, not science fiction.

💡#5

@usr_bin_roygbiv
https://x.com/usr_bin_roygbiv/status/2051033330179506563
"The massive thing I don't think anyone is realizing right now is basically anyone with a consumer GPU or Mac is able to do autoresearch and optimizations at home right now which was previously only labs with massive data/compute a few weeks ago." This is the meta-observation that ties the rest of the day's threads together. The implication that nobody has fully priced in: optimization is now democratized. If you have a 3090 and overnight, you have what an OpenAI ablation team had three weeks ago. The competitive moat for big labs on routine model/config optimization just collapsed.

💡#6

@StijnSmits
https://x.com/StijnSmits/status/2050929771941437713
The most useful warning anyone has shared on autoresearch yet. "The tricky thing with optimizing system prompts via (pi-)autoresearch etc is that it overfits to evals and quietly hammers out-of-distribution multi-turn performance." The reason this matters: most autoresearch loops use a benchmark as the loss function, but real production traffic is multi-turn and OOD by definition. The model can be eval-optimal and production-broken at the same time. If you're running an autoresearch loop right now, the question to add is "what's my OOD multi-turn regression test?"

💡#7

@alokbishoyi97
https://x.com/alokbishoyi97/status/2051070089856962924
Open-sourced autoresearch orchestrator (evo) supporting Hermes Agent + genetic tree search with custom frontier picking strategies (GEPA, eps-greedy). v0.4 sneak peek adds remote sandboxes — spin up experiments on Modal, e2b, Daytona, AWS, Azure or any SSH-able box. Evo handles the workspace, runtime env, logs, traces, scales experiments wherever. The interesting design choice: making the autoresearch loop infrastructure-agnostic so you can run it across whatever compute you can pay for. Multiple replies in the thread, the author is shipping fast. Worth tracking.

💡#8

@yx3io
https://x.com/yx3io/status/2051053458107584837
Maybe the most philosophically interesting agent loop in this dataset: a self-referential agent loop where Hermes watches its own source code change, day by day, and tries to make sense of what it's becoming. The author is using the agent's own diff history as an input signal, then asking it to narrate its own evolution. This is the strangeloop version of the Karpathy autoresearch idea — the agent isn't optimizing code, it's writing its own internal model of itself. 963 impressions, 21 likes. Nobody is doing this for production reasons; that's why it's interesting.

💡#9

@samhogan / @peyman_razaghi
https://x.com/peyman_razaghi/status/2051032733158412333
HALO — Hierarchal Agent Loop Optimizer. RLM-based agent optimization technique capable of recursive self-improvement. The retweet caught the framing: instead of just running an autoresearch loop on a static prompt, HALO operates on the *loop structure itself*. The orchestration layer rewrites how the agent decomposes tasks. Different from Karpathy's autoresearch in scope: Karpathy optimizes a code surface, HALO optimizes the agent's planning policy.

💡#10

@chenzeling4
https://x.com/chenzeling4/status/2050814372486811723
HALO's stats and origins: 424 stars, from context-labs, RLMs evaluate outputs and generate feedback to improve behavior each loop. Benchmarked on AppWorld. Python. The number that matters: 424 stars is small, which means there's still time to read this codebase before it becomes the ambient pattern. AppWorld is the right benchmark choice — it's specifically about how agents handle multi-step tool use across real apps, not contrived single-shot benchmarks.

💡#11

@willleebuilds / @BernieAdams23 / @RigneySec
https://x.com/willleebuilds/status/2051074552134799638
DeepClaude wraps Claude Code's agent loop around DeepSeek V4 Pro and claims 17x cheaper inference. The author's analysis is the part worth quoting: "the interesting bit isn't the savings. It's that the harness is now portable. The loop is the moat, not the model behind it." This is the key point that runs through the whole day's threads — Anthropic's Claude Code is increasingly being treated as a control loop independent of Anthropic's actual model. If the loop generalizes across DeepSeek, Codex, OpenRouter, then the question of "which model is best" stops mattering as much as "which loop is most extensible."

💡#12

@UfukDegen
https://x.com/UfukDegen/status/2051088239579345329
Noustiny — built on top of Nous Research's Hermes Agent, the most concrete agentic-loop video creation pipeline anyone shipped this week. 12 generic Hermes tools and 13 skills, organized into four pipelines: story-state, character continuity, voice, render. Story tree graph manages canon, branching, splicing. Character pipeline runs IP scrubbing → portrait builder → registry lookup → alias resolver. Voice pipeline runs persona director → sample builder (yt-dlp + ffmpeg) → ElevenLabs IVC clone → cleanup that frees the cached voice ID after render. The architectural takeaway: Hermes itself stayed unmodified, all of this is registry-compatible plugins. 19,046 impressions, 281 likes. Open source on GitHub. This is what "agent-native software" looks like.

💡#13

@morefishoil
https://x.com/morefishoil/status/2050948112894824527
The single most actionable diagnostic anyone shared on agent loops this week: "every MCP tool's schema gets re-injected at the top of every turn even when unchanged — that's 2-5K tokens/turn idle. Pinning schemas + ephemeral cache control cuts a 12-step agent loop by ~30K tokens." This is exactly the kind of plumbing-level optimization that autoresearch could find but humans usually don't bother with. If you're running multi-step agents and your token bill feels mysterious, this is the first thing to check.

💡#14

@justgrm
https://x.com/justgrm/status/2050939174891446492
Documents a config-prompt that runs Claude Code's agent loop outside Anthropic's servers via OpenRouter. The agent loop, skills, and harness all stay intact — just the request endpoint changes. Author claims tens of thousands of users have entered the prompt since the video. This is the same point as DeepClaude from a different angle: the Claude Code loop has decoupled from the underlying model. Anthropic has lost the only thing keeping it inside Claude. Whether or not you agree with the framing, the bypass works and people are using it.

💡#15

@MystiqueMide
https://x.com/MystiqueMide/status/2051031692434178206
14-day Hermes Agent experiment, full commitment: only Hermes, no OpenCode setups, connected to OpenAI API + DeepSeek API, running 24/7 on a VPS. The pitch: runs continuously, actually improves over time, lightweight, plugs into open-source models, connects across multiple providers. He's specifically testing the self-improving claim — does Hermes's procedure files actually compound across two weeks of unattended use? Reports coming. Watch this one.

💡#16

@VivekIntel / @dr_sensor (RT)
https://x.com/VivekIntel/status/2050957303734735007
Hermes Alpha — self-improving Bug Bounty AI with a clean two-agent architecture. Overseer handles strategy, Hunter handles execution. Fully autonomous. The interesting design choice is the explicit role separation: most "self-improving" agents conflate planning with execution, then can't tell which one is the bottleneck. Splitting them gives you a measurable feedback loop on the strategy layer alone.

💡#17

@RohanParija1
https://x.com/RohanParija1/status/2051041832373739773
Purpose-Agent. Every other agent runs run→fail→retry with no memory, no learning. Purpose-Agent introduces a "Purpose Function" that scores each step 0-10, smart-feedbacks, turns failures into heuristics + memory. Claim: a 1.7B model fails run 1 but solves run 3. Local-first, self-improving, no fine-tuning. This is the Karpathy autoresearch idea at the agent-step level rather than the global research level — every step gets graded, the heuristics accumulate. Whether the 1.7B-solves-by-run-3 claim holds across more domains is open, but the architectural principle — score steps not just final outputs — is genuinely useful.

💡#18

@BretKerr
https://x.com/BretKerr/status/2051038169894642092
Forensic on the famous "$150K monthly Anthropic bill." The actual mechanism: Romanian dev "Claudiu" left an unsupervised agentic loop running for an entire billing cycle. The loop was infinite. Citing that as "the cost of AI coding" is like citing a forgotten EC2 cluster as the cost of cloud. The author's full math: maxed multi-agent budget is ~$400/month, fully-loaded US engineer is $16-21k/month, so AI is 2.4% of one engineer. If it gives you a 15% productivity bump, that's 525% ROI. The actual hidden cost source: Claude Code's 1-hour prompt cache evicts when you walk away for 61 minutes, and your next prompt is a full-context write of your million-token repo. "The bills come from idleness."

💡#19

@leo_liuye
https://x.com/leo_liuye/status/2051044599876272344
"@karpathy autoresearch as a research community not a single PhD — that's exactly the right frame. We run the same pattern for business decisions. Not one AI analyst. A network of agents that debate, challenge, and build on each others conclusions. Compound intelligence." This is one of the cleanest non-coding mappings of the autoresearch idea. The framing matters: a single AI analyst is just a glorified spreadsheet, but a network of agents trained to falsify each other gets you something closer to actual judgment. Same architectural pattern as the ATLAS trading system above, applied to corporate decision-making.

💡#20

@DROOdotFOO
https://x.com/DROOdotFOO/status/2050879188559692148
Released a few MIT-licensed agent skills for autoresearch under a "Meta" category. The author explicitly calls out the autoresearch skill: "tweak however you like but directionally it has been doing work on the mule ai account." That phrase — "doing work on the mule ai account" — is the tell. Someone has an agent skill quietly running autoresearch on a real production account they care about, and it's working. Not a demo, not a benchmark, just an asset doing its job. Worth grabbing the skill.

💡#21

@cyrilXBT
https://x.com/cyrilXBT/status/2050897489285488814
NVIDIA Nematron 3 Nano Omni — collapsing multi-tool workflows into one agent loop. Reads documents, watches dashboards, listens to voice notes, processes video demos, scans community threads, analyzes charts and tables. Not one at a time. All of it together in one pass. Then turns the result into structured output: a report, an SOP, an action plan. The framing the author lands on: "the real value is what happens after the inputs are processed. Every model is multimodal now." 6,629 impressions, 104 likes. The interesting structural point: NVIDIA shipping this means the multimodal-agent-loop is becoming a single-model primitive rather than a stitched-together orchestration.

💡#22

@LearnWithBrij
https://x.com/LearnWithBrij/status/2051011635658076527
The clearest mental model anyone published this week on the four agent primitives: Skills (what to know), MCP (how to connect), Hooks (when to automate), Subagents (who does the work). If you're building agentic loops in production, this is the lens to use. The takeaway buried at the end is the right one: "the agents that will win in production aren't the ones with the smartest LLM at the center. They're the ones with the cleanest separation of concerns between knowledge, connection, automation, and delegation." Same pattern showing up across HALO, ATLAS, Noustiny — all of them work because they decompose along these primitive lines.

💡#23

@runfusion
https://x.com/runfusion/status/2050962678911483921
Fusion v0.16 shipped major agent runtime + performance + stability improvements. Roadmap: auto research, llama.cpp first-class support, one-click Docker node setup, experimental Telegram plugin. Already used 95% of the time on phone. The author is putting auto research on the same release as Telegram support — meaning autonomous experiment loops will run from your phone in the next few weeks. 538 impressions. Worth tracking if you want autoresearch on a non-laptop form factor.

💡#24

@imraan
https://x.com/imraan/status/2051003569806164413
"OpenAI shipped @karpathy autoresearch as /goal. My X timeline lost its mind. Zero credit, zero link, zero mention." Worth surfacing because it's the cleanest expression of the running tension this week — Codex /goal is doing for many users what Karpathy's autoresearch demonstrated, but the lineage is unattributed. The author's read: "incumbents don't innovate. They wait for the purest primitive, copy it, rebrand it, and call it theirs." Whether you take this as a fair callout or as overcorrection, it's the meta-narrative under everything else this week.

💡#25

@chrisozydev
https://x.com/chrisozydev/status/2050869137774150109
"Specialized AI agent teams running on stacked Mac Minis is peak indie hacker energy. One agent for architecture. One for coding. One for testing. Meanwhile enterprise teams are still in meetings about their AI strategy document. Speed wins." Short post but it captures the structural shift this dataset documents: the architectural unit of work has shrunk from a team of humans to a stack of Mac Minis running specialized agents. Same thread as the @regent0x_ "two-Mac dual-agent SaaS" case from Super User Daily — different angle, same conclusion.

💡#26

@livingagentic
https://x.com/livingagentic/status/2050940996905734533
Reply to @Teknium (Hermes Agent author) asking whether the Kanban-board orchestration is good for an agent org of less than 12 agents. The author is building one on Mac mini M4 Pro 48GB. The micro-detail worth noting: "agent org of less than 12 agents" is now a unit of measurement that exists. Six months ago it didn't. The question of whether your orchestration scales below or above 12 agents is the new "monolith vs microservices" question.

💡#27

@ShubhamInTech
https://x.com/ShubhamInTech/status/2050984285369471259
Building "today: analytics for ai agents. Tomorrow: infrastructure for self improving ai agents." 159 impressions, but the 2-step plan is the right framing. You can't build self-improving infrastructure without first having clean instrumentation — the analytics layer is the precondition. Most people skip step 1 and try to ship step 2 directly, which is why their "self-improving" loops never actually improve.

📡 Eco Products Radar

Eco Products Radar
Tools, frameworks, and plugins mentioned 3+ times across the dataset:

Karpathy Auto Research — the source primitive. 21k stars. 630-line Python script.
Hermes Agent (Nous Research) — by far the most-cited self-improving agent runtime this week.
ATLAS / ATLAS-GIC — Karpathy autoresearch applied to financial markets, 25+ debating agents, open source.
HALO (context-labs) — Hierarchal Agent Loop Optimizer, RLM-based, AppWorld benchmarks, 424 stars.
DeepClaude — wraps Claude Code's agent loop around DeepSeek V4 Pro, 17x cheaper.
Codex /goal — OpenAI's autoresearch-style loop primitive, increasingly compared to Karpathy's.
Pi / pi-autoresearch — alternative autoresearch loop runtime mentioned multiple times.
evo (alokbishoyi97) — open source autoresearch orchestrator, genetic tree search, GEPA.
Noustiny (UfukDegen, on Hermes) — agent-native video creation pipeline, full storyboard chain.
Purpose-Agent — step-level scoring framework, turns failures into heuristics.
Fusion (runfusion) — agent runtime with auto research coming in next release, mobile-first.
Lattice — agent governance / authorization layer (delegated capability, signed actions).
OpenRouter — increasingly the routing layer beneath agent loops; cache-control header reset agent loop economics April 30.
MCP — still the standard tool-connection protocol; schema injection cost is now a known loop optimization target.
ElevenLabs — voice synthesis, paired with multiple agentic-loop pipelines.

← Previous

Super User Daily: 2026-05-05

Ideas Radar: 2026-05-05

← Back to all articles

Loop Daily: 2026-05-05

More Articles

Comments