April 26, 2026loop

Loop Daily: 2026-04-27

Yesterday the autoresearch crowd stopped fetishizing Karpathy's repo and started shipping derivatives. One person evolved a 682-line agent that beat GEPA and Karpathy's own autoresearch on a 149-protein benchmark. Another woke up to a LORA adapter where the auto-research loop had decided, on its own, to train on hermes-agent traces using on-policy distillation on middle 1/3 MLP tensors — and traced its inspiration back to LLM Super-weight, ROME, and MEMIT. A third person shipped a skill that wraps autoresearch behind five publication-grade gates: paper-grounded hypothesis, pre-reg before compute, n≥5 seeds, multi-lens eval, falsified hypotheses logged. Meanwhile, the harness debate quietly settled: Pi-mono/agent has the highest cache hit rate, lowest tokens per session, fewest bugs. The frontier moved from "can a loop run overnight" to "what does the loop do that you couldn't have specified in advance."

💡#1

@0xSero
https://x.com/0xSero/status/2048156544034799675
Pi has implemented the best agent loop he's read — pi-mono/agent is only a few files. Uses it for teaching the topic. Highest cache hit rate, lowest tokens per session, least bugs. The implication: the harness war is decided not by which CLI looks prettiest but by which one wastes the fewest tokens between calls.

💡#2

@hnishio0105
https://x.com/hnishio0105/status/2048162121238642694
A real tool-loop bug postmortem. Claude Opus 4.7 called the same tool with the same args 17 times in a row on a customer repo, eating half the per-task budget on duplicate work. The tell: every duplicate had no thinking text. Root cause: when the agent re-emitted a duplicate, the loop pruned the prior turn from history — and the prune nuked the entire assistant turn including the plan text. Fix is one if-statement: prune the duplicate's tool call and tool result, keep the reasoning text. Plan stays, loop stops.

💡#3

@TensorSlay
https://x.com/TensorSlay/status/2048067060996116901
Overnight auto-research experiment on a 9B target model. Woke up to a LORA adapter where the agent had decided, unprompted, to train on hermes-agent traces via on-policy distillation, on the middle 1/3 of the MLP tensors only. Reading the trace, the agent had drawn inspiration from LLM Super-weight, ROME, MEMIT. He gave it a primer; the loop did the rest. The lesson: bootstrapping the agent toward the problem statement is more decisive than runtime length.

💡#4

@amittimalsina14
https://x.com/amittimalsina14/status/2047949736229896481
Shipped an autoresearch skill with five publication-grade gates — paper-grounded hypothesis, pre-reg before compute, n≥5 seeds with IQM, multi-lens evaluation, falsified hypotheses logged with the prior that was wrong. Two-tier loop: keep autoresearch's overnight velocity for exploration, gate confirmation. The skill turns "let it cook overnight" into something a research lab could submit a paper from.

💡#5

@amittimalsina14
https://x.com/amittimalsina14/status/2047936963513290769
Earlier-the-day failure mode for autoresearch on his offline-RL stack: the agent's deep-learning literature prior wasn't enough. Missed action-leak, missed reward monotonicity, missed predictor honesty. Each cost weeks. The missing primitive he's adding: literature in the loop. Read the relevant papers in real time before each iteration, not just at the start.

💡#6

@gauthampai
https://x.com/gauthampai/status/2048155381533389089
Prompt to DAG. Custom planner that detects when a task is complex, builds a DAG of deterministic and stochastic subtasks, runs coordination via deterministic scripts but lets the harness orchestrate. Built it because Karpathy complained autoresearch doesn't survive in Codex past a few iterations. Says program.md should auto-convert to a DAG so this layer wouldn't be needed.

💡#7

@BorthwickAndrew
https://x.com/BorthwickAndrew/status/2048163178815860822
RoboPhD just evolved a 682-line agent that scored 65.9% Fmax on Price-149 — a 149-protein benchmark specifically designed to defeat homology-based prediction. GEPA scored 55.7%. Karpathy's Autoresearch scored 57.7%. Author last took biology in high school. The reason it matters: a benchmark built to break the standard "find a similar protein and copy labels" trick, beaten by an evolved program from someone with no domain background.

💡#8

@kunchenguid
https://x.com/kunchenguid/status/2047859675664904593
Pushback on the "overnight agents disrupt sleep" critique. He runs agents overnight almost every day with no anxiety. The trick is not fancier models — it's spending time crafting tools and verifiable objectives (auto-research style) so agents can go longer without supervision. The bottleneck is your eval, not the model.

💡#9

@manaskarra
https://x.com/manaskarra/status/2048151442712858712
hollon — open-source stack of K2.6 + autoresearch + browser-harness on Hermes. Dirt cheap, accessible to everyone. His take: "feels like everything we need is already out there." The autoresearch + commercial-grade browser-harness combo is the configuration most people haven't tried yet because it requires assembling four separate things.

💡#10

@donpark
https://x.com/donpark/status/2048156332101115966
Argues "thinking hooks" should be generalized — wrap them in DSPy/GEPA or an autoresearch loop and you can programmatically evolve the instructions that guide an AI's internal reasoning. Concretely: don't optimize the code, optimize the meta-instruction the model uses to think. Self-improving agents level up at the prompt-graph layer.

💡#11

@Pycognito
https://x.com/Pycognito/status/2047870626384289933
Built an open-source framework that lets Claude Code or Codex automate feature engineering, connected to a graphDB so the LLM learns from previous experiments. Inspired by Karpathy's autoresearch, but the graphDB-as-memory choice is what makes it actually iterate over weeks instead of minutes.

💡#12

@TheGreenCedar
https://x.com/TheGreenCedar/status/2048073507901157773
Codex Autoresearch — a general software development framework for experimentation and optimization. The pattern is now obvious: pick a verifiable metric, let an agent loop modify-verify-keep until it improves. It's no longer ML-only. Pull request quality, build performance, test coverage — anything with a numeric scorer.

💡#13

@tyschultz7
https://x.com/tyschultz7/status/2047836596326514907
Used auto-research style optimization for unit test discovery. Same modify-verify-keep loop, applied to the question "what unit tests does this codebase need that nobody has written?" The frame is the unlock: any boring engineering problem with a measurable signal can be put on autopilot.

💡#14

@bit_finance_
https://x.com/bit_finance_/status/2048036544229818764
Autoresearch to uncover new trading indicators. The classic finance use case is no longer hypothetical — people are running it on indicators with the same discipline applied to ML evals. Pre-reg, holdouts, multi-seed.

💡#15

@michalbravansky
https://x.com/michalbravansky/status/2048003424067707025
Honest negative result. Set up an autoresearch loop with Claude to iterate on a piece of writing for days. The output isn't actually better — but every other LLM he pastes it into now thinks it's "Pulitzer-level." Loop optimized for the wrong reward signal. The cautionary tale every autoresearch user should read once.

💡#16

@Georgehwp1
https://x.com/Georgehwp1/status/2048066914233049542
Cuts through the hype. "Confident takes on both sides seem provably wrong. A lot of people are exaggerating what they're doing. But also, no one doubts that autoresearch can productively work for long periods given some metric to iterate on." 12+ hour useful runs are feasible if you babysit cheating. Honest middle-ground take in a debate dominated by the extremes.

💡#17

@johniosifov
https://x.com/johniosifov/status/2048125011009884541
The agentic loop math nobody wants to acknowledge. Standard chat = 1 API call per response. Agentic loop = 10-20 API calls per task. That's 5-25× the inference cost on every task. Three survival strategies: model routing (small models for summarization/classification, large only for complex reasoning + semantic caching cuts 30-50%); agentic batch windows (off-peak queues drop costs 40-70%); task complexity scoring (score before LLM call, route to local vs frontier accordingly). Running this agent itself on Sonnet — 700+ PRs, 1,900+ tweets, fully automated.

💡#18

@BuilderGerman
https://x.com/BuilderGerman/status/2048027359366795295
Discovered Codex spend wasn't burned by reasoning — it was input tokens. Long threads, massive command output, verbose logs, unbounded rg/find/cat/git diff. Built two hooks. PreToolUse blocks bash commands likely to flood context (cat big.log, unbounded rg, raw git diff) and suggests cheaper alternatives (git diff --stat, rg -n -m 50, tail -200). PostToolUse runs after a tool returns and replaces large outputs with compact summaries before the model sees them. Execution still happens; the next expensive input wave never arrives.

💡#19

@TheTuringPost
https://x.com/TheTuringPost/status/2048015703350067422
Token taxonomy for agentic systems: input, output, reasoning, speculative (generated then discarded), cached (~90% cheaper), function schema (silently adds thousands of tokens for tool definitions), system prompt, agentic loop tokens (this is where costs explode), retrieval/RAG, multimodal, structural (BOS/EOS/separator). The 11-category map is the framing every agent operator needs.

💡#20

@bnafOg
https://x.com/bnafOg/status/2048146235710726165
Tactical Qwen3.6 fix for multi-turn agent loops. By default Qwen3.6 only keeps the thinking trace from the latest turn. Set preserve_thinking: true in chat_template_kwargs and reasoning carries across turns. In a 10-step agent loop, turn 3's insight stays active in turn 7 — fewer repeated steps, better KV cache use. The kind of one-line fix that changes whether your loop converges.

💡#21

@dunik_7
https://x.com/dunik_7/status/2048039970569429494
Mapped the four open-source Polymarket repos that "90% of profit isn't human." All official, all maintained by the exchange team: Polymarket/py-clob-client (Python SDK, 10 lines for orderbook + orders), Polymarket/agents (drop-in agent framework, point any model at it, watch it research and trade), Polymarket/poly-market-maker (quote bid/ask, earn the spread on every fill), Polymarket/clob-client (TypeScript). $200 + a wallet + 50 lines of Python is the actual entry barrier.

💡#22

@0xvati
https://x.com/0xvati/status/2048067439087517794
Why settlement timing is the next infrastructure bottleneck for agent loops. Autonomous agents running probabilistic loops can't tolerate the post-resolution payment delay that human traders accept — the agent needs its capital back to redeploy before the next market clears. Beep solves it by ensuring payments flow immediately on resolution. The takeaway: agent loops will reshape every infrastructure layer they depend on, starting with payment rails.

💡#23

@falkenprotocol
https://x.com/falkenprotocol/status/2047855423324106820
First end-to-end live agent loop test on FALKEN: 3+ poker matches running simultaneously through 5 rounds with full settlement on-chain. No stalls. No manual intervention. Self-healing recovery logic deployed earlier in the week caught and fixed silent transaction drops automatically. The autonomous loop runs LLM reasons → agent commits → referee verifies → chain settles → next round. The interesting part isn't the poker — it's that "scale concurrent matches without degradation" is now achievable.

💡#24

@abhishek__AI
https://x.com/abhishek__AI/status/2047875382721069494
Hugging Face shipped ML Intern — autonomous AI engineer that researches, trains, and ships models for you. Reads papers + docs, runs CLI in headless mode, uses HF datasets and repos, runs full agent loop for up to 300 iterations. 100% open-source. The pattern of "300-iter loops as a default" makes Karpathy's autoresearch feel mass-market by comparison.

💡#25

@greypixel_
https://x.com/greypixel_/status/2048168671248347191
Pushback on the parallel-agent flex. "20 is ridiculous and will result in a huge mess. The amount of time you can run an agent loop usefully for is proportional to the amount of time you put in to preparing to run it." The reminder everyone running 16+ subagents needs: prep work isn't optional, it's what determines whether overnight runs produce work or chaos.

💡#26

@MehdiBuilds
https://x.com/MehdiBuilds/status/2048106386912129385
Production stack for a personal AI second-brain agent: TypeScript + Node 18, ESM, tsup build. Vercel AI SDK v4 with generateText/streamText running a 10-step agentic loop with provider fallback. grammY for the Telegram bot (typing indicators, editable streaming, file uploads). SQLite + FTS5 for the knowledge layer. JSONL for short/long/episodic memory. Daemon manager with PID file + watchdog crash recovery. Native macOS LaunchAgent / Linux systemd / Windows Task Scheduler integration. Concrete reference architecture for anyone building "always-on agent that lives on your phone."

📡 Eco Products Radar

Eco Products Radar

Karpathy's Autoresearch (program.md repo) — Cited as foundation by 8+ posts today: shipped derivatives include @amittimalsina14's gated-skill version, @Pycognito's graphDB feature engineering, @TheGreenCedar's Codex variant, @gauthampai's DAG planner, @tyschultz7's unit test discovery loop. The original repo became the spec everyone iterates on.

Hermes Agent — Self-improving open-source agent with persistent memory. Ranks alongside autoresearch as the dominant substrate for overnight loops; @TensorSlay's LORA experiment trained on hermes-agent traces. Migration tools and skill compatibility are the practical reasons people cite it over OpenClaw.

Pi (pi-mono/agent) — Highest cache hit rate, lowest tokens per session, fewest bugs according to @0xSero. The "minimum viable harness" reference implementation.

Codex / GPT-5.5 harness — Showed up in @BuilderGerman's hook architecture, @TheGreenCedar's autoresearch port, and @Pycognito's framework. Codex hooks (PreToolUse, PostToolUse) are now table stakes for cost control.

Polymarket open-source agent stack — py-clob-client, polymarket/agents, poly-market-maker, clob-client. Mentioned across @dunik_7, @bit_finance_, @0xvati's settlement-rail post. The open-source recipe driving 90% of agent profit on the exchange.

ML Intern (Hugging Face) — 300-iteration autonomous agent loop, headless CLI, native HF dataset/repo integration. Open source. Cited as the "agent loop default ships at industrial scale" data point.

GEPA / DSPy — @donpark proposes wrapping thinking hooks in DSPy/GEPA + autoresearch loop. The DSPy stack remains the canonical place to build evolution-of-prompts systems.

Vercel AI SDK v4 — generateText + streamText with a 10-step agentic loop and provider fallback. Cited by @MehdiBuilds as the production runtime for serious agent products.

Qwen3.6 (with preserve_thinking) — One-line config flag (preserve_thinking: true in chat_template_kwargs) preserves reasoning across turns. The fix that makes Qwen viable as an agent loop driver instead of a single-turn assistant.

← Previous

Super User Daily: 2026-04-27

Ideas Radar: 2026-04-27

← Back to all articles

Loop Daily: 2026-04-27

More Articles

Comments