April 18, 2026loop

Loop Daily: April 19, 2026

Autoresearch stopped being a demo this week. The clearest tell: a chess engine built itself from expert-level to 2718 ELO — top-50 human globally — in 70 autonomous experiments with zero human touching the code after day one. Two papers dropping within days of each other (TurboQuant for inference compression, and the chess autoresearch result) land on the same constraint from opposite ends. Meanwhile pi-autoresearch opened on Monday, went public Tuesday, 5,000 stars by Thursday. Shopify is running it. Farcast is running it on GTM. DE Shaw just got outshipped by a hobbyist with a $5 VPS. Today's feed shows the pattern stabilizing: "one laptop, one overnight run, validated improvements banked into git" is the new baseline for anything with a measurable objective function.

💡#1

@innoscoutpro
https://x.com/innoscoutpro/status/2045066518245863707
A chess engine built itself from "expert" up to 2718 ELO — top-50 human globally — in 70 autonomous experiments. No human touched the code after day one. Side by side, Google shipped TurboQuant, an inference compression paper that makes a dense 27B model run 3x faster on the same hardware with 4.9x compression at 3-bit. Kurtosis went from 900 to 2.9 after rotation. Independent reproduction confirmed. Framed as the moment the "deployable agent" gap finally collapsed: autoresearch is the mechanism finding improvements humans missed, TurboQuant is the enabler making those improvements cheap to run at scale. Both forced into existence by the same hardware bottleneck.

💡#2

@mustafa01ali
https://x.com/mustafa01ali/status/2045188957579653193
Pointed autoresearch at the Shopify mobile app. Cut 5 minutes off every CI run, made unit tests 34% faster, cold launch 300ms quicker, reduced re-renders on a key screen by 95%. All agent-driven. None of it would have happened this fast manually. The pattern is more important than the numbers: attach an autoresearch loop to a CI/build pipeline with real metrics and it finds compounding wins that a senior engineer never has calendar time to chase.

💡#3

@davebcn87
https://x.com/davebcn87/status/2045109196887130408
pi-autoresearch opened Monday, open-sourced Tuesday, 5,000+ stars by Thursday. Shopify running it on unit tests (300x faster), React components, CI builds (65% reduction). Second post from Dave captures the exact category shift: "AI agents used to code like us, but faster. pi-autoresearch does the work we never start. Nobody plans three months to cut build time by 30% — it's valuable but boring and expensive, so it never ships. Agents don't care. They don't get bored. They run while you sleep." The unlock is not the speed, it is the willingness to do work that has positive ROI but was never going to get scheduled.

💡#4

@shobitfarcast
https://x.com/shobitfarcast/status/2045117573373517994
Farcast has been using autoresearch for GTM — not ML — and calls it the biggest rethink of AI-assisted work they've done. Karpathy built autoresearch to run ML experiments overnight on one GPU: describe what to explore, point it at the repo, wake up to 100+ validated experiments and a full git history. Farcast repurposed the same loop for ICP validation. Describe the ICP hypothesis, let the agent iterate against real data, keep what produces tighter output, kill what produces generic noise. Results: 80% better outputs on GTM plans — not 80% faster, 80% more specific. The difference between "post on Twitter and LinkedIn" and "here are the three Slack communities where your exact ICP asks questions every week, and the message that works in each one."

💡#5

@JustinPBarnett
https://x.com/JustinPBarnett/status/2045105132400951609
Ran an autoresearch loop all night — 458 rounds — with Opus 4.7 xhigh. Used 12% of his weekly quota. That is the honest economics of overnight agentic work on the current Max plan: one unsupervised all-nighter costs roughly a day and a half of weekly capacity. Worth noting as a ceiling reference because most of the "autoresearch at home" threads underspecify how much compute they consume.

💡#6

@JanKoritak
https://x.com/JanKoritak/status/2045057235512856681
Client project, broken voice agent, 48-hour deadline. Used Karpathy's Auto-Research pattern as the debugging harness — describe the failing behavior, let the agent cycle through hypotheses, validate, commit, next. "It worked." Specifically useful as a data point that autoresearch is not purely experimental — applying it under a hard deadline is the kind of adoption signal that benchmarks don't capture.

💡#7

@ks_kulk
https://x.com/ks_kulk/status/2044998047793594701
Frames a specific, scary application: autoresearch directed at quantum circuit optimization for breaking ECDSA. Google's published results already cite three algorithmic improvements — attack priming before the public key is shown, Litinski 2023 amortization tricks, Chevignard 2026 width optimization. A goal-specified autoresearch agent with a clear prompt ("find strategies for minimizing logical qubits and Toffoli gates for this quantum circuit, use Google's results as a starting point, beat their published numbers") is a feasible path to further engineering optimizations. Raises the ceiling question for the whole category: if autoresearch can squeeze 10% more out of cryptographic-attack circuits, the threat timeline changes.

💡#8

@eliautobot
https://x.com/eliautobot/status/2045233314177720799
Used Karpathy's autoresearch approach to build an autonomous movement system for an agent world he's constructing. Applied the pattern and had something working in roughly 3 hours. The usefulness here is not the game — it's that "3 hours from idea to working autonomous behavior" is the baseline now for anything that can be reduced to a measurable objective and an editable file.

💡#9

@ben_burtenshaw
https://x.com/ben_burtenshaw/status/2045085809800356112
Hands-on guide to setting up multi-agent autoresearch in Karpathy's pattern using open models — works with Codex, Claude, OpenCode. Five-agent configuration with scoped tools and permissions: researcher searches HF papers and proposes hypotheses, planner maintains an experiment plan and log, workers update scripts and launch HF jobs on GPU, reporter monitors jobs and pushes metrics to a Trackio dashboard. Ran it for 4 hours, 32 jobs completed, modest baseline improvement. Worth reading as the concrete template for a multi-agent autoresearch setup that actually runs rather than just diagrams on a slide.

💡#10

@bibhashroykol
https://x.com/bibhashroykol/status/2045153809048215733
Cautionary dispatch from production: four LangChain agents, two of them drifted into a recursive cycle where Analyzer kept sending clarification requests and Verifier kept responding with instructions. 11 days. $47,000 API bill. Team assumed the rising spend was user growth. At 85% accuracy per step on a 10-step workflow, overall success rate is 19.7% — Lusser's law compounds, 85% step accuracy × 20 steps is 4%. The fix is three hard limits on every agent loop: max iterations, max spend, max runtime. One config line with a $50 cap would have stopped the $47K loop in minutes.

💡#11

@bnafOg
https://x.com/bnafOg/status/2045049548800766052
Opus 4.7 shipped task_budget_tokens as a public beta. Claude now gets a countdown for the full agentic loop — thinking plus tool calls plus output — which lets the model self-calibrate when to stop versus keep exploring. Without it, long agentic runs silently collapse into context when one planning step eats the whole budget. Same post flags that Gemini 3.1 Pro shares its extended thinking budget across all subtasks in an agentic loop, which is why one hard planning step can eat the entire run. Most developers have not set task_budget_tokens even though the effect on multi-step reliability is immediate.

💡#12

@ybkim95_ai
https://x.com/ybkim95_ai/status/2044962799559073934
CoDaS — AI Co-Data-Scientist for biomarker discovery from wearable sensor data. Multi-agent loop with hypothesis generation from large-scale wearable datasets, statistical and ML validation, adversarial critique to reject spurious findings, literature-grounded reasoning for mechanistic plausibility, human-in-the-loop for report review. Tested across 3 cohorts (N = 9,279). Identified 66 candidate digital biomarkers passing strict validation, found consistent circadian instability signals across independent depression datasets, recovered known metabolic markers like TG/HDL and CRP. Collaboration across Google Research, DeepMind, MIT. The paper is the first public demonstration at scale of autoresearch-style agent loops doing clinically meaningful biomarker discovery.

💡#13

@Forsy_AI
https://x.com/Forsy_AI/status/2045080521810559373
Browserbase replaced a dozen internal bots with one generalized agent called "bb" and published the architecture. Lives in Slack, writes PRs, queries Snowflake, investigates production sessions. One agent loop with skills loaded on demand, credentials never exposed to the sandbox, 100% feature request coverage, zero human effort. This is the convergence point four different platforms — Anthropic, OpenAI, Cloudflare, Browserbase — shipped independently in the same week: single agent loop, lazy-loaded skills, scoped permissions, isolated sandbox. That convergence across both provider-side and user-side shops is the strongest signal that this architecture is the right one.

💡#14

@HerselmanI
https://x.com/HerselmanI/status/2045106843249172925
Short but worth the link. Built a self-improving agent loop to solve an actual business problem, not a benchmark task. Less technically impressive than Karpathy's original ML runs, but far more useful as a real-world validation that the same pattern works outside research. The thread below this one also carries useful back-and-forth on what breaks when you run the loop against messy production data.

💡#15

@NoDataSold
https://x.com/NoDataSold/status/2044930597424902431
Built a controlled critique loop between two distinct Hermes agents — Max (executor, verifier, enforcer) and Nova (filter, challenger, taste layer). Shared context but separate personalities and behaviors. Key moves: enforcement moved into the tool dispatch path so invalid actions get blocked mechanically; STRICT mode for strong patterns with escape-hatch and duration cap; pattern memory upgraded to track intent fingerprints, failure type, fix, frequency, last seen, success rate; asymmetric reward so ineffective patterns lose weight faster; Nova and Max each have persistent SOUL rules; multi-agent loop with structured tension between speed/taste and correctness/proof. Canary upgrade concept uses Nova first before Max. One of the more elaborate publicly-documented architectures for persistent-identity multi-agent systems running in the wild.

💡#16

@samhogan
https://x.com/samhogan/status/2045174875921481979
Catalyst — an LLM fine-tuning engine that turns production traces into small, self-improving frontier-quality models owned by the user. Schematron, the internal model, is trained and deployed on it. Notable because it targets the specific middle of the market that's been underserved: teams with real production traces but no infra to use those traces as training signal. If the claim holds up, this is autoresearch applied to model weights themselves rather than to code.

💡#17

@omarsar0
https://x.com/omarsar0/status/2045241905227915498
Autogenesis — a self-evolving agent protocol where agents identify their own capability gaps, generate candidate improvements, validate them through testing, and integrate what works back into their own operational framework. No retraining, no human patching — just an ongoing loop of assessment, proposal, validation, integration. Situates it alongside Meta-Harness and the Darwin Gödel Machine line as the cleanest protocol-level take on continual self-improvement so far. Read the paper. These are the designs that will define what "static agents age quickly" actually means.

💡#18

@Underfox3
https://x.com/Underfox3/status/2045277944264749147
Nvidia researchers demonstrated an agentic LLM-based coding framework autonomously evolving a multi-million-line EDA tool at the full scale of the ABC logic synthesis system. Self-improving code generation applied to production-scale tooling — not toy benchmarks. If this kind of result holds up, the same agentic loop pattern works not only on greenfield projects but on 20-year-old codebases that were considered uneconomical to touch.

💡#19

@VictorATHER
https://x.com/VictorATHER/status/2045217042152718346
Proposed concept with concrete inspirations. A closed-loop AI system that simulates market reactions to GTM strategies, runs iterative A/B/n experiments, and outputs the optimal strategy before real-world deployment. Points at Karpathy's Autoresearch and Guo Hangjiang's Mirofish as the two reference repos. Sits at the exact intersection where GTM meets autoresearch — and if Farcast's 80% result above is replicable, this is where the next wave of autoresearch-style deployments will land.

💡#20

@duin_dev
https://x.com/duin_dev/status/2045037721190608992
Single-developer report: built a self-improving agent by implementing nothing more than a simple write/recall memory tool. The agent discovered its own memory on its own and started using it for improvements. Small anecdote but it underlines a point the fancy papers sometimes obscure — the self-improvement pattern works with a memory primitive that can be built in an afternoon.

📡 Eco Products Radar

Eco Products Radar

pi-autoresearch: The current flagship. Open-source extension for the "pi" AI coding agent that runs in terminal. Give it a goal like "make tests faster" and it runs an endless experiment loop — edits code, benchmarks, keeps wins, reverts losses, logs everything to autoresearch.jsonl. Shipped Monday, open-sourced Tuesday, 5K+ stars by Thursday. Shopify running it on unit tests, React components, CI builds.

Karpathy's Autoresearch (original pattern): The spiritual parent of everything in this digest. Run ML experiments overnight on one GPU: describe what to explore, point an AI agent at your repo, wake up to 100+ validated experiments with full git history. The agent only commits improvements. The pattern generalizes to anything with an editable file plus a measurable metric.

Hermes Agent (Nous Research): Self-improving AI agent, self-hostable, runs locally or on VPS, writes its own skills every ~15 tool calls, persistent memory (MEMORY.md + USER.md + SQLite). Per-model tool-call parsers make it the best harness for local models right now. Ollama 0.21 ships native Hermes support.

Trackio / HF Jobs: The monitoring layer underneath the ben_burtenshaw multi-agent autoresearch setup. Reporter agent pushes job events and metrics to a Trackio dashboard while workers launch HF jobs on GPU. Worth watching as the observability layer for autoresearch-style loops finally gets standardized.

Autogenesis / Meta-Harness / Darwin Gödel Machine: The three reference points for protocol-level continual self-improvement currently under discussion. Autogenesis (just out) is the cleanest protocol take — assess, propose, validate, integrate. Read these if you are trying to think beyond single-loop autoresearch into systems that rewrite their own loops.

task_budget_tokens: Opus 4.7 public-beta parameter that gives the model a countdown for the full agentic loop. Fewer context collapses on multi-step jobs. Under-used — most developers have not set either task_budget_tokens or the xhigh effort tier even though the effect on agentic reliability is immediate.

← Previous

Super User Daily: April 19, 2026

Ideas Radar: April 19, 2026

← Back to all articles

Loop Daily: April 19, 2026

More Articles

Comments