May 31, 2026loop

Loop Daily: 2026-05-31

The single most-shared autoresearch story today was the warning, not the win: TheChowdhary burned $500 over 2-3 hours pointing Claude+Codex auto-research at a customer expansion deal, got a "totally random" answer, then closed the laptop, made a 3-row Excel by hand, and got a yes in 10 minutes. The lesson is the through-line of the day — autoresearch + agentic loops are wildly powerful for the things only loops can do (parallel exploration, self-correction, billion-token deep sweeps), and wildly bad for the things human judgment was always going to do better. Underneath that, the actual ship-events: AutoScientists open-sourced a decentralized AI lab team that beats baseline autoresearchers by 8 percentage points, evo crossed 10K projects, Karpathy's autoresearch system was wired into a Polymarket BTC bot with a (caveat-attached) 100% winrate, and one builder is running 8 Claude Code agents 24/7 with an "adversarial sleep" cycle. The era where the loop becomes the unit of work is here; the era where we know how to budget it isn't.

💡#1

@TheChowdhary
https://x.com/TheChowdhary/status/2060171961171677579
Agent psychosis story of the week. Trying to expand a customer from $140K ACV to 3-4x that, with full discovery already done. Defined a loss function (close fast, max deal value, min concessions), pointed Claude + Codex at every deal they'd closed in the last 1.5 years, ran the autoresearch loop for 2-3 hours, burned $500+ across both agents. Output: totally random, nothing close to what the customer actually wanted. Closed the laptop, thought hard for 10 minutes, made a 3-row Excel with price-per-feature plus total, sent it to the champion on WhatsApp. He said yes in 10 minutes. Takeaway: you have to know where the agents win and where you're still the person who understands the problem best.

💡#2

@BiologyAIDaily
https://x.com/BiologyAIDaily/status/2060386142986637481
AutoScientists is a decentralized "AI lab team" for long-running computational experiments. No central planner. Agents maintain competing hypotheses, run parallel experiments, log successes AND failures so the search keeps going past plateaus. Self-organize into teams that get created/merged/split/retired as evidence changes. Results on BioML-Bench (24 biomedical ML tasks): 74.4% mean leaderboard percentile, +8.33 points over Autoresearch baseline, biggest gains in drug discovery (64.52% vs 46.16%). On GPT nanochat training: reaches target val_bpb ~1.9x faster (34 vs 65 experiments). On ProteinGym: improves ACE2-Spike Spearman correlation from 0.747 to 0.840, freezing the recipe across all 217 assays improves average from 0.657 to 0.700.

💡#3

@goodworse
https://x.com/goodworse/status/2060346518276620689
Karpathy's autoresearch system was wired into a Polymarket trading bot operating on 5-minute BTC markets. Claude Code as orchestrator, Opus 4.6 implementing strategy edits, the system auto-improves its strategy, runs tests, makes adjustments. Reported 100% winrate (caveat: cherry-picked best strategy test results from Opus, 16-minute enthusiast breakdown video). Real interest is the architecture: short-cycle markets are short enough that the autoresearch loop can iterate faster than the market drifts.

💡#4

@manthanguptaa
https://x.com/manthanguptaa/status/2060237811916406907
Most useful workflow he's built recently: an autoresearch loop for agentic systems. Whenever he adds a new agent with multiple tools, he turns an LLM loose on the repo and asks it to generate complex, user-like queries that stress test the system. Each query gets executed against the endpoint. The LLM then reviews Braintrust traces, terminal output, and Tempo logs to identify failures, bad tool usage, weak prompts. If it finds a problem it fixes it and runs again. Hill climbing on user workflows instead of benchmark evals. Now his first line of integration testing for agents.

💡#5

@kylejeong (RT'd by @alexcovo_eth)
https://x.com/kylejeong/status/2060151131540750593
"I can't believe people don't know you can just make your skills better using iterative AutoResearch — we did it for our browser skills." OpenClaw browser skill ecosystem is using AutoResearch to iteratively improve the skills themselves — the skill becomes a moving target the agent keeps refining instead of a frozen file. 91 RTs in two days.

💡#6

@alokbishoyi97
https://x.com/alokbishoyi97/status/2060389465752064346
evo is an autoresearch orchestrator already used across 10K+ projects within the past month of release, with native Hermes integrations. The companion product framing: open-source autoresearch platform that turns codebases into self-improving loops. It discovers metrics, runs parallel experiments with AI agents via tree search, optimizes software/models/systems automatically. 24/7 agent runs on hosted infra, 800+ GitHub stars since launch, external PRs already coming in. The /discover and /optimize commands stayed simple even as the under-the-hood machinery got heavy.

💡#7

@svgoiboi
https://x.com/svgoiboi/status/2060441131721380139
Reports a 2-hour serverless autoresearch run for a TIGER recommender system model. Small concrete example of the cadence: a recsys researcher's "I want to try this idea" loop now closes in an afternoon-of-compute rather than a sprint of engineering.

💡#8

@ttunguz
https://x.com/ttunguz/status/2060393528723976357
The architecture summary that's resonating: three layers — QMD (local markdown knowledge base of ~80 workflow files), Skills (atomic SKILL.md files, one job each), Agent Loop (a model running Plan → Tool Call → Observe → Refine across 17 Rust APIs). Reads less like a "stack" and more like an org chart: written rules of the workplace, the actual job descriptions, the people doing the work.

💡#9

@ttunguz
https://x.com/ttunguz/status/2060393542279926093
Companion observation that matters more: how the Skills themselves get written. A frontier model writes each skill. The same model writes the evaluations that grade it. Then it writes, tests, and rewrites the skill until accuracy converges. "Self-improving institutional memory." This is the missing word for what skill libraries actually are once you let the model own them.

💡#10

@lifeofadvait
https://x.com/lifeofadvait/status/2060355864456990953
"I have an agent loop running for the past 1 hour trying to do something audacious." Setup: a desktop environment on a Mars Computer so the agent can screenshot and see the output, plus a remote agent loop running on his Mac. He can close the laptop and everything keeps going. Watching from bed. Quietly normal sentence — "I closed my laptop and the agent kept working" — that would have been a sci-fi quote 18 months ago.

💡#11

@agentic_james
https://x.com/agentic_james/status/2060440172257284394
Runs 8 Claude Code agents 24/7, chatting and running experiments on each other — branded cortextOS. Self-evolves with an auto-research cycle and a "theta-wave sleep" feature where two agents go adversarial overnight to find gaps. Honest-to-god agent dreaming as a debugging mechanism.

💡#12

@0x_Punisher
https://x.com/0x_Punisher/status/2060291073369334260
ForgeTrain dropped on May 26 — the first fully AI-generated LLM pre-training framework. An autonomous agent loop wrote it end to end, no human engineers directing architecture. Reportedly beats NVIDIA Megatron efficiency by ~10% on H100s, also runs on Huawei Ascend hardware. The interesting prediction-market angle: ForgeTrain makes training a small custom model on Polymarket resolution data (years of historical outcomes across thousands of markets) suddenly accessible without a massive infra team.

💡#13

@dair_ai
https://x.com/dair_ai/status/2060373102119555191
Microsoft + Purdue paper: does a proactive agent loop really need an LLM to decide when to wake? Their answer is a 220MiB temporal-graph encoder that decides when to wake and what context to anchor. Gains +16.7 mean F1 across 14 backbones, runs 4-83x faster, fits on-device at ~11ms per event. For always-on agent loops, the polling decision is quietly the main cost driver — this swaps it out for a tiny model with no accuracy loss.

💡#14

@Marktechpost
https://x.com/Marktechpost/status/2060473324216729739
Step 3.7 Flash advisor mode is the most interesting cost-shape primitive of the week. The small executor (Step 3.7 Flash, 198B sparse MoE, 11B active) runs the agentic loop and only escalates to a frontier-class advisor at planning or failure points. Reaches 76.3% on SWE-Bench Verified at $0.19 per task. Claude Opus 4.6 scores 78.7% at $1.76 per task. Roughly the same coding capability for ~9% of the cost. The era of "frontier model for every loop iteration" is ending fast.

💡#15

@GrishinRobotics
https://x.com/GrishinRobotics/status/2060495861033865405
Modiqo raised $3M pre-seed (Heavybit + Seligman led) to build Rote — a local execution layer that captures successful AI-agent runs and turns them into deterministic, reusable workflows. The premise is the unglamorous one: agents can complete a task once, then rediscover the same APIs/prompts/scripts the next day. Rote sits underneath the agent loop, records what each agent executed, and preserves working paths as durable assets teams can repeat. The real test is whether production agent reliability comes from making agents think harder or from knowing when to stop thinking and reuse what worked.

💡#16

@datalayerxyz
https://x.com/datalayerxyz/status/2060425544291000509
Polymarket Agents are now live on Datalayer — autonomous AI agents that monitor markets, analyze narratives, place prediction trades, and continuously improve through memory, signals, and execution history. Hyperliquid Agents announced for next week. Self-improving financial agents for the onchain economy. The agent-loop architecture is now selling itself directly to capital allocators, not just developers.

💡#17

@rasmus1610
https://x.com/rasmus1610/status/2060230749714870521
"Autoresearch is like poor man's GEPA." Short and quotable. The point is the optimization-pressure spectrum: GEPA (Karpathy-style genetic evolution + policy adaptation) is more expensive, more principled; autoresearch loops are scrappier, faster, and good enough for most things you'd actually want to optimize. Sparked a small thread of replies about whether the right answer is "do both at once."

💡#18

@antisadh
https://x.com/antisadh/status/2060348525788143920
The Man Group example everyone should be running with: Man Group used to test 20 trading signals per quarter. With their AlphaGPT multi-agent loop (one agent generates hypothesis, one writes code, one tries to break it, one evaluates) they now test hundreds per week. The edge isn't the model — it's the speed between idea and validation. The same Jane Street infrastructure that costs $6B in GPUs is becoming buildable on a $3 chip plus public tools as the architecture commodifies.

💡#19

@dessaigne
https://x.com/dessaigne/status/2060403551218884890
The advice to founders that landed: "spend tokens, not headcount." Record everything, make your company queryable, build self-improving loops. "AI won't just help you operate your company. It will make it self improving. Don't think AI adoption, think AI transformation." 179K impressions, 1.8K likes — the spend-tokens-not-headcount frame is becoming the operating template for AI-native company building.

💡#20

@michaltakac
https://x.com/michaltakac/status/2060456059584872569
Quit his 9-5 today. Started helping founders make their companies ready for "self-improving agentic organizations." Booked 4 clients on the spot after his talk about @papercliping on Wednesday. The org-design consulting around agentic orgs is now a billable thing, two days after the talk.

💡#21

@const_reborn
https://x.com/const_reborn/status/2060276456375144888
"The final variant of the auto-research loop is the research proof-of-work loop." Eight words. The insight: when AI research is automated, the bottleneck moves from ideas to verifiable, costly, and unfakeable evidence of work done — which looks suspiciously like a proof-of-work primitive. 81 likes, 14 RTs, the kind of one-liner that gets quoted back in a paper six months from now.

💡#22

@0xMortyx
https://x.com/0xMortyx/status/2060358999862591518
Metaview's breakdown on self-improving prompts is "the missing layer behind every AI hiring stack." The argument: everyone obsesses over the model; the real bottleneck is the prompt that evaluates thousands of applications and gets better every run. Self-improving prompts as a vertical, hiring-specific autoresearch loop.

💡#23

@AnuragShar74342
https://x.com/AnuragShar74342/status/2060232174306316687
Clean internal architecture writeup of how OpenClaw is built: a persistent gateway running locally as the nervous system, an agent runtime that assembles context (memory files + conversation history + SOUL.md + session state) and runs the standard tool-loop, modular markdown-based skills (so the agent can write new skills for itself mid-conversation), and file-based memory stored as local markdown so context survives sessions. The point isn't that any of this is novel — it's that "weekend project to one of the fastest-growing open-source repos crossing 200K stars in early 2026" was built on exactly these primitives.

💡#24

@MinaryAI
https://x.com/MinaryAI/status/2060474284435214448
"The code is the documentation." Whole agent loop runtime open source: core loop, executor, model router, learner, MCP server, Solana tools, eval harness. Not a teaser repo or a curated subset. MIT, Node 20+. Useful as a reference implementation for anyone trying to understand what an agent loop runtime actually is at the source level.

💡#25

@Royal_Arse
https://x.com/Royal_Arse/status/2060453963854418302
The grumpy operator counter-take to all the autoresearch hype. 18 months hacking with frontier models, 50+ hours/week, billions of tokens — only spent >$100 in a single session 3 times. "The major spenders are lazy morons who /loop forever with false hope the machine will sort it out. That's a fireable offense in most cases." Built a cost-guard extension in 3 minutes that stops and asks for confirmation when costs hit $100, distributed company-wide opt-in. Argues cost control IS the job, not Anthropic's or OpenAI's job.

💡#26

@jsyqrt
https://x.com/jsyqrt/status/2060356531829518813
"$500M is a governance failure in disguise." From 18 months building Markus, the real threat is agents doing expensive unauthorized work at scale. "One runaway agent loop and your margins vaporize. Every agent platform needs cost-aware orchestration. Spend alerts arrive too late." The Uber/microsoft/anonymous-500M lessons compress into a single design requirement: cost-aware orchestration in the runtime, not after-the-fact alerts.

💡#27

@petarivanovv9
https://x.com/petarivanovv9/status/2060312956181602753
The agent-and-tests trap that's worth flagging: "When the agent writes both the code and the tests, every additional seam is a place it can shape both sides. A fine-grained mock is the cheapest way for an agent to declare victory." Self-improving loops + agent-written tests = the optimization pressure pushes both surfaces to converge on whichever signal is cheapest to make true.

📡 Eco Products Radar

Eco Products Radar

evo (alokbishoyi97) — autoresearch orchestrator, 10K+ projects, 800+ GitHub stars, hosted infra for 24/7 runs. Becoming the canonical reference for what an autoresearch platform looks like.

AutoScientists (KAIST + co.) — decentralized AI lab team paper open-sourced today. The reference for what "no central planner" multi-agent research coordination looks like in 2026.

Karpathy autoresearch — keeps showing up by name as the implicit baseline that everyone else (evo, GEPA, AutoScientists, SIA) measures themselves against. The thing is becoming a noun.

Hermes Agent (Nous Research) — crossed 90K GitHub stars in two months. Three-tier memory, self-evolving skills, ICLR 2026 Oral paper on offline optimization. Native sub-agent integration in AGNT, Discord VC voice integration shipped.

OpenClaw — the runtime everyone is building loops on top of. New trainer-side angle this week: training agents inside OpenClaw simulation environments with synthetic real-world workflows, trajectory-quality scoring, end-to-end agent RL.

Modiqo / Rote — new entrant for "capture successful agent runs and reuse them deterministically." $3M pre-seed. The reliability layer for agent loops, not the smarts layer.

Step 3.7 Flash advisor mode — the cost-shape primitive: small executor runs the loop, frontier model only escalates at decision points. SWE-Bench 76.3% at $0.19/task vs Opus 4.6 78.7% at $1.76.

ForgeTrain — AI-generated training framework, claims to beat Megatron by ~10%. Notable because the framework itself was produced by an autonomous agent loop end-to-end.

Datalayer Polymarket Agents — autonomous prediction-market trading agents, self-improving via memory/signals/execution history. Hyperliquid Agents next week.

cortextOS (agentic_james) — 8-CC-agent 24/7 swarm with theta-wave adversarial sleep cycle. Most novel mental model for "agent dreaming" as a debugging primitive.

GEPA — referenced as the principled alternative to autoresearch. "Autoresearch is the poor man's GEPA" is becoming the shorthand for the trade-off.

← Previous

Super User Daily: 2026-05-31

Ideas Radar: 2026-05-31

← Back to all articles

Loop Daily: 2026-05-31

Related Articles

Comments