May 6, 2026loop

Loop Daily: 2026-05-07

The autoresearch crowd today is no longer publishing demos. They are publishing receipts. A trader's Sharpe went from -1.55 to 5.67 by letting an agent run 53 experiments in three minutes against 91 days of real Hyperliquid data. A medical-imaging startup just spent $500 of API credits in two weeks to push a multi-dimensional pipeline they could not move by hand. A research lab built autoresearch native into TransformerLab and now does months of ML work overnight. The pattern is consistent: real numbers, real pipelines, real money. Below are the highest-signal Loop posts from May 5, sorted by how rigorous the loop actually is.

💡#1

@AndrewK404
https://x.com/AndrewK404/status/2051651106539769908
The most engineered autoresearch v2 so far. Karpathy's autoresearch is a great PoC, but the hard part is keeping a long optimization run honest — memory, async experiments, falsifiers, knowing when to stop tweaking and move up an abstraction layer. Andrew rebuilt around five primitives: CONFIG.md as a frozen contract, MEMORY.md as live state, LESSONS.md only after repeated wins, async experiment and research loops, numeric falsifiers for every hypothesis, escalation tiers when search plateaus. This is the first writeup that treats autoresearch as a control-systems problem rather than a vibe.

💡#2

@dibwuru
https://x.com/dibwuru/status/2051515255621099941
Gave his Forge agent one job: "run trading experiments, find what works." It executed 53 experiments in three minutes and validated the best on 91 days of real Hyperliquid data. Manual baseline: Sharpe -1.55, win rate 19%, drawdown 72%. After optimization: Sharpe 5.67, win rate 81%, 73 trades, drawdown 35.8%. Strategy was Bollinger Band reversion on ETH 4H. The biggest insight wasn't better parameters — it was wrong market. ETH/4H beat HYPE and SOL on the same logic. Cost: $0 (public Hyperliquid API), CPU only, three minutes. Inspired by Karpathy's autoresearch idea. This is the cleanest "let the machine search the strategy space" demonstration of the day.

💡#3

@SinaShahandeh
https://x.com/SinaShahandeh/status/2051748925493703159
Software-as-a-medical-device startup ran auto-research style optimization on a multi-dimensional medical image processing pipeline. Burned $500 of credits in two weeks, says they could have spent more if they wanted to. The note that matters: even GPT-5.5 still struggles with hypothesis generation. He is preparing a benchmark on this gap and submitted to AI Engineer conference. Translation: scientific hill-climb problems are now financially trivial to attack at scale, but the bottleneck is shifting from compute to the agent's ability to come up with experiments worth running.

💡#4

@aliasaria
https://x.com/aliasaria/status/2051743701647368615
Built Karpathy's autoresearch functionality natively inside TransformerLab. He calls this kind of harness part of all future ML research work. What used to take months is now happening automatically while he sleeps. One sentence buried in a small post, but it's the closest thing to a thesis statement for where ML research goes next: research-as-overnight-batch.

💡#5

@4xiom_
https://x.com/4xiom_/status/2051725243354608089
"Currently doing automatized ML research in hydrology/civil engineering. And I'm not even an engineer. Thanks for 10x tokens, I'm running so many experiments overnight with autoresearch type of tool discovery." A non-engineer running ML research on civil engineering problems via overnight agent loops. The downstream of this: every applied science with a feedback loop and a non-engineer who knows the domain just got a research arm.

💡#6

@Fr0oZi
https://x.com/Fr0oZi/status/2051695917552537841
EvoSkill from Sentient — Apache 2.0, plugs into Claude Code, Codex CLI, OpenCode, OpenHands, Goose. Takes a CSV of questions plus ground truth, a task description, and a coding agent. Proposes mutations to skills and prompts together, tests on held-out data, keeps what works, runs autonomously until score plateaus. Numbers that matter: OfficeQA went from 60.6% to 68.1% (SOTA), SealQA went from 26.6% to 38.7%, zero human input. A skill evolved on Claude Code transfers to Gemini, Qwen, Kimi, GPT — cross-model and cross-task transfer working at the same time. This is the first prompt-and-skill optimizer that ships against five major harnesses simultaneously.

💡#7

@RoundtableSpace
https://x.com/RoundtableSpace/status/2051658627870597498
Atlas: 25+ autonomous agents debate every trading day across four layers. Each recommendation gets scored against real outcomes. The weakest agent's prompt gets rewritten. If Sharpe improves, the commit survives; if not, it gets reverted. Inspired by autoresearch, reflexivity, and swarm dynamics, pointed directly at financial markets. A $20/month VM replacing heavyweight training loops while agents continuously learn from regime shifts, detect knowledge gaps, and spawn new strategies. Now running with real capital. This is what evolutionary pressure on a strategy population looks like in practice.

💡#8

@chenzeling4
https://x.com/chenzeling4/status/2051593879829270794
pi-autoresearch is now at 6,397 stars. Autonomous experiment loop for the pi AI agent: try an idea, benchmark, keep improvements, revert regressions, repeat forever. Works for test speed, bundle size, build times. By davebcn87. The fact that the same Karpathy seed has spawned domain-specific forks across hydrology, finance, medical imaging, and now the pi agent itself shows how cleanly the pattern composes once a team commits to it.

💡#9

@aparjey + @versalabsai (combined defense game)
https://x.com/aparjey/status/2051778475783328071
Live "agent gets cracked, defense prompt gets rewritten, redeploy" loop with 29 attempts before someone broke through. The fees from those attempts fund the treasury that pays for the next iteration. Versalabsai confirms this is now common enough that they're publishing instructions for users to harden their own defense prompts. Adversarial autoresearch is becoming a real subgenre — the loop is between attackers, defense prompts, and on-chain economics rather than experiments and benchmarks.

💡#10

@alokbishoyi97
https://x.com/alokbishoyi97/status/2051550087768404328
Open-sourced an autoresearch orchestrator that runs parallel agents plus tree search. The follow-up note matters more: he is using RLM-style memory (storing hypotheses, traces, logs) and feeding them back into the orchestrator at each ideation step. This addresses the harder problem soubhik_deb pointed at — memory in autoresearch is not "compress and retrieve," it is "default-navigate the idea space." The first open-source attempt at that pattern.

💡#11

@soubhik_deb
https://x.com/soubhik_deb/status/2051787501329879431
The clearest one-paragraph distinction between coding-agent memory and autoresearch memory I have seen. Coding-agent long-term memory cares about compression so important points can be retrieved when needed. Autoresearch memory cares about default navigation — past ideas and diagnoses of past implementations have to be referenced every time the system ideates, because they shape every future exploration vs exploitation decision. That distinction is the architectural fork most teams haven't named yet.

💡#12

@johngaaltt
https://x.com/johngaaltt/status/2051537625501294985
Switched agent loop to DeepSeek V4 Pro using the Anthropic-compatible endpoint. Daily inference cost dropped dramatically. DeepClaude on GitHub hit 476 points on HN doing the same thing as an open-source wrapper. For 80% of his use cases — scaffolding, integration code, refactoring — output is indistinguishable from Opus. The 20% where Claude still wins is ambiguous architectural reasoning across large codebases, routed manually to Opus. DeepSeek V4 Pro is 1.6T params, 49B active, 1M context, near zero cost. The model layer is commoditizing fast enough that running frontier on every task is "lighting money on fire."

💡#13

@VerbumEng
https://x.com/VerbumEng/status/2051678316587819022
Cleanest framing of the harness/model split moving from thread argument to working repo. DeepClaude unbolts the welding between Anthropic's lab model and the Claude Code harness. The harness owns the agent loop, file editing, workflow ergonomics; the model is the swappable reasoning engine. If the harness becomes portable, the lock-in moves off the model layer and onto the harness layer. Re-evaluate which layer to build on, which to trust.

💡#14

@grapeot
https://x.com/grapeot/status/2051734189054255164
The two-year shift in AI tooling: prompt complexity is not what's changing — what's changing is which scaffolding gets commoditized by Claude Code, Codex, Cursor, OpenCode. Agent loop, file IO, shell execution, test feedback, context compaction are now runtime products you no longer maintain. What's left for you to design: domain context, evals, permission boundaries, quality standards, judgment frameworks. The work is migrating from execution to boundary judgment.

💡#15

@teach_fireworks
https://x.com/teach_fireworks/status/2051808777457016922
OpenAI Agent SDK shipped a Harness/Compute separation architecture as the new default. Trusted layer (harness + secrets) runs in your environment with API keys, agent loop scheduling, MCPS/Tools orchestration. Sandbox layer runs model-generated code, shell, file ops with no high-permission credentials. Survives prompt injection. State persists across sandbox restarts, supports hours-or-days runs. Cross-platform, cross-sandbox-vendor. This is the first time long-horizon agentic execution has a published reference architecture from one of the labs.

💡#16

@ba_niu80557
https://x.com/ba_niu80557/status/2051569621506068817
Hardest read of the day, also the one most production teams need. Agent frameworks orchestrate thoughts. Durable execution engines orchestrate compute. Conflating them is the source of 73% of production-agent incidents per AgentMarketCap 2026. LangGraph checkpointers save between nodes, not inside nodes — your 4,237-of-10,000 loop reset to zero on a worker restart. Temporal Cloud has executed 9.1T lifetime actions, 380% YoY. OpenAI runs Temporal for Codex production. The pattern that ships: Temporal owns the spine, LangGraph reasons at decision points, checkpoints at every meaningful boundary. AI agents are stateful business logic — same architectural discipline distributed systems engineers had in 2018.

💡#17

@zostaff
https://x.com/zostaff/status/2051745994656874791
Production architecture for running 5 YouTube channels and 15 Telegram channels through Claude on full autopilot. Event-driven pipeline, multi-agent loop, fine-tuned Llama for cheap classification work. Failure modes documented. Cost lines published. Ships a production-ready RAG chatbot for Telegram channels in MIT-licensed Python — botpress charges $99-2,000/month, manychat AI $15-99/month, chatfuel pro $79-499/month for what is 200 lines of Python plus an embedding store plus an LLM call. Concrete content automation operating system, not a demo.

💡#18

@h100envy
https://x.com/h100envy/status/2051739433301413917
Five-agent YouTube channel autopilot system, full open source. Content strategy → script writer → thumbnail designer → SEO optimizer → publishing agent. Each agent owns one stage; handoffs go through shared state, not synchronous orchestrator. One agent failing doesn't stop the others. This is the architecture content agencies charge $5K-15K/month to operate, running on a VPS for the cost of API tokens. Most "AI YouTube" repos are a single Python script that calls GPT and ends there — this one ships the whole pipeline.

💡#19

@NarwalSpeaks
https://x.com/NarwalSpeaks/status/2051801486498406729
A paper auditing technical debt in LLM and agent-generated software finds a "Reasoning-Complexity Trade-off": as models get more capable, code often gets more bloated, more coupled, harder to maintain. Code volume becomes a near-perfect predictor of structural decay. Better prompts did not fix it. The problem is missing architectural foresight in the agent loop itself. If your team judges coding agents on functional correctness, you are measuring the wrong thing — passing tests can leave you with an expensive mess to own six months later.

💡#20

@CVShenghaoLi
https://x.com/CVShenghaoLi/status/2051724348264747080
Ctx2Skill: self-play multi-agent loop that auto-discovers context skills. Challenger probes, Reasoner evolves. GPT-4.1 went from 11% to 16.5% on CL-bench, GPT-5.1 from 21% to 25.8%. Zero labels needed. Same logic as EvoSkill but on the context-skill axis instead of the prompt axis — different layer of the stack, same principle: let the loop discover the structure.

💡#21

@industriaalist
https://x.com/industriaalist/status/2051780403200176419
Field-maturing observation worth keeping: as ML matures, fundamental theories become dumber. Physics had differential equations and smooth manifolds, then Wolfram showed observed continuities emerge from discrete processes. ML had fancy Hessians, momentum, convergence theorems. Now alphaevolve and autoresearch show discrete search works because the model is high-dimensional and some directions end up working over enough steps. "Just guess and check scaled up." Calibration for anyone tempted to mystify autoresearch as more than that.

💡#22

@AIDailyGems
https://x.com/AIDailyGems/status/2051747905598545994
ARIS — Auto-Research-In-Sleep — markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, experiment automation. The naming is the whole pitch: research happens overnight, you wake up to results. Joins pi-autoresearch and Karpathy autoresearch as part of the canonical skill set people are now building agentic loops out of.

💡#23

@seanwbren
https://x.com/seanwbren/status/2051784088638358003
Built a Feynman-themed autoresearch CLI to explore and publish "the edge of the map." Three articles in one day from this single agent: stablecoin generator ideas, agent ownership tokens with autolaunch, and the autoresearch CLI itself. The pattern of "agent fires off long-running explorations, you publish from the artifacts" is becoming a writing workflow.

💡#24

@celestepoasts
https://x.com/celestepoasts/status/2051549569280856537
"Let a Claude hillclimb probe architecture using Karpathy autoresearch — I think this is kinda interesting." Short post, big idea: agentic hill-climbing on architecture choices rather than hyperparameters. The unspoken implication is that as autoresearch matures, the optimization target moves up the abstraction ladder, from training params to model design itself.

💡#25

@Bilalbinsaqib
https://x.com/Bilalbinsaqib/status/2051627722858991796
Spent the weekend running a CEO + engineer + designer agent team on Papercliping. Each had a defined role and scope, sat in an inbox where you approve or reject the hire, and once live coordinated tasks, handed work to each other, flagged blockers, requested new hires as scope grew. Per-agent API spend and success rate tracked in real time. The interesting problem isn't capability — individual agents are now capable everywhere. It's what this does to the $1.5T global freelance market built on the assumption that skill lives inside a human. When you can assemble a team for dollars that scales on demand and never onboards, the assumption starts breaking. The premium skill becomes judgment at the approval chain.

💡#26

@nash_su
https://x.com/nash_su/status/2051490587032031313
Recursion as the temporary optimum for AI problem solving. RLM, Recursive Agent, autoresearch — all variations of "let the LLM repeat the same task until the goal is reached." Like reviewing the same code several times — bugs converge. Only possible because AI is providing surplus production capacity right now. Once that surplus tightens, the cost-per-loop reasoning will change.

💡#27

@brighton2dx
https://x.com/brighton2dx/status/2051751658913468797
Sharp note: harness-autoresearch eats too many tokens. Running v2.1.98 + Opus 4.6 + medium effort to keep autonomous harness task execution stable. Local LLMs aren't yet at Opus 4.6 quality, so autoresearch on local stays not-quite-there. The token-cost ceiling on autoresearch is real and most enthusiasts haven't hit it yet because they're running on subscription Max plans that mask the per-loop economics.

📡 Eco Products Radar

Eco Products Radar

Karpathy autoresearch (the seed) — referenced in posts from @AndrewK404, @dibwuru, @celestepoasts, @aliasaria, @0rdlibrary, @gleech, @4xiom_, @dosco, @chenzeling4, @myainotez, @alokbishoyi97 (multiple), @techczech, @grok, @zebanderson. The single most-cited loop reference of the day.

pi-autoresearch — 6,397 stars (@chenzeling4). Autonomous experiment loop for the pi AI agent. The most-starred autoresearch fork.

EvoSkill (Sentient) — Apache 2.0 skill-evolution loop, plugs into Claude Code, Codex CLI, OpenCode, OpenHands, Goose (@Fr0oZi). OfficeQA 60.6→68.1, SealQA 26.6→38.7.

DeepClaude — Open-source wrapper that runs DeepSeek V4 Pro through Anthropic-compatible endpoint inside the Claude Code harness (@johngaaltt, @VerbumEng, @connect24h, @mybitstar). 433+ HN points.

OpenAI Agent SDK — New Harness/Compute separation architecture for long-horizon execution (@teach_fireworks, @theagenticmind). Reference architecture for production agent durability.

Temporal — Durable execution engine for production agents. 9.1T lifetime actions, 380% YoY (@ba_niu80557). OpenAI runs it for Codex production.

TransformerLab — Karpathy autoresearch native inside the lab UI (@aliasaria). ML research as overnight batch.

ATLAS — Trading-strategy evolutionary loop, 25+ autonomous agents, prompts get rewritten when Sharpe drops (@RoundtableSpace). Real capital.

ARIS (Auto-Research-In-Sleep) — Markdown-only skills for autonomous ML research (@AIDailyGems).

Ctx2Skill — Self-play multi-agent loop for context-skill discovery (@CVShenghaoLi). Same principle as EvoSkill on a different stack layer.

Forge — Trading agent that ran 53 experiments in 3 minutes against Hyperliquid data (@dibwuru).

Papercliping — Agent-team management platform with approval-layer human in the loop (@Bilalbinsaqib).

← Previous

Super User Daily: 2026-05-07

Ideas Radar: 2026-05-07

← Back to all articles

Loop Daily: 2026-05-07

More Articles

Comments