Loop Daily: 2026-06-02
Karpathy joined Anthropic this week to do autoresearch at frontier scale, and the timing matters — the same week, the discipline of running agents in production stopped being implicit and started being curriculum. Most of the field already converged on a one-line definition: autoresearch is /goal plus a verifier you can trust, running while you sleep. What's new in this slice is that people are now reporting how long, how big, and how much it costs — 8-10 hour reverse-engineering loops, 24-hour 35-agent stacks improving production SOTA by 5 points, 1B-token days on Max 20x plans, and runaway sub-agent loops that ate a month of budget in 40 minutes. The closed loop is the product. The model swap is the easy part.
#1
@hu_yifei
https://x.com/hu_yifei/status/2061166665677856973
The honest current setup: daily overnight /goal mode autoresearch, parallel Codex CLI sessions during the day in fast mode, joking about how many $200 subscriptions are needed to support the habit. The reference number for how loud the loop is: he was on API keys before and was easily spending 5-10x what the Max plan would cost. This is the prime exhibit for why subscriptions exist for agent runners and why Anthropic is now metering them.
https://x.com/hu_yifei/status/2061166665677856973
The honest current setup: daily overnight /goal mode autoresearch, parallel Codex CLI sessions during the day in fast mode, joking about how many $200 subscriptions are needed to support the habit. The reference number for how loud the loop is: he was on API keys before and was easily spending 5-10x what the Max plan would cost. This is the prime exhibit for why subscriptions exist for agent runners and why Anthropic is now metering them.
#2
@ryancarson
https://x.com/ryancarson/status/2061167249298206952
Real production migration via auto-research. Used Devin to run the full loop on a specific Untangle workflow (atomizing discovery document requests). The auto-research told them haiku-4.5 hits the required accuracy and wins on latency vs whatever they were running. So they're switching. The structural point: the loop is a model-selection tool, not a model-training tool, and that's already enough to flip production economics.
https://x.com/ryancarson/status/2061167249298206952
Real production migration via auto-research. Used Devin to run the full loop on a specific Untangle workflow (atomizing discovery document requests). The auto-research told them haiku-4.5 hits the required accuracy and wins on latency vs whatever they were running. So they're switching. The structural point: the loop is a model-selection tool, not a model-training tool, and that's already enough to flip production economics.
#3
@mladluka
https://x.com/mladluka/status/2061131327491944735
The biggest reported loop in this slice. 35+ parallel agents in a 24-hour-plus autoresearch run on an NLP imbalanced-class problem. Architecture: 10 research agents scraping arXiv, GitHub, Kaggle, Medium and saving findings to research.md; 10 implementation agents adapting research to the concrete problem, training models, running evals, logging to logs.md; 10 feedback agents doing full error analysis and proposing the next architecture iteration to feedback.md. The PR ended up over 1 million lines of code. Result: the existing production real-time SOTA model moved up 5 points.
https://x.com/mladluka/status/2061131327491944735
The biggest reported loop in this slice. 35+ parallel agents in a 24-hour-plus autoresearch run on an NLP imbalanced-class problem. Architecture: 10 research agents scraping arXiv, GitHub, Kaggle, Medium and saving findings to research.md; 10 implementation agents adapting research to the concrete problem, training models, running evals, logging to logs.md; 10 feedback agents doing full error analysis and proposing the next architecture iteration to feedback.md. The PR ended up over 1 million lines of code. Result: the existing production real-time SOTA model moved up 5 points.
#4
@shreybirmiwal
https://x.com/shreybirmiwal/status/2061216703762293101
At Modal's autoresearch hackathon: EEG controller reading brainwaves, plus labeled actions (spacebar, mouse move), fed into an autoresearch loop to train a model that predicts browser control from brainwaves. wave1 spikes on blink, mapped to jump. Result: playing Geometry Dash with brainwaves. Worth marking as the moment autoresearch jumped past code into raw signal-to-action.
https://x.com/shreybirmiwal/status/2061216703762293101
At Modal's autoresearch hackathon: EEG controller reading brainwaves, plus labeled actions (spacebar, mouse move), fed into an autoresearch loop to train a model that predicts browser control from brainwaves. wave1 spikes on blink, mapped to jump. Result: playing Geometry Dash with brainwaves. Worth marking as the moment autoresearch jumped past code into raw signal-to-action.
#5
@Daniel_Alami
https://x.com/Daniel_Alami/status/2061091064367214889
Long thread, but the substance is rare. Nine-week solo project starting March 28 chasing Karpathy's autoresearch idea. Released ZTARE (Zero-Trust Adversarial Reasoning Engine): one AI proposes, a second tries to break it, a third referees, deterministic gates decide. Pointed it at four hard open problems — modified gravity, neural scaling laws, consciousness-ascription governance, Navier-Stokes — and what it returned was binding-constraint diagnoses, not positive laws. A score of 98 was the warning sign that the system was retrieving academic mainstream and dressing it up. About a month in he pointed the system at itself; a miner scores every primitive and feeds it back. Cognitive-Firm released as a second repo: three layers, with a Claude Code / Codex workforce as the Research Director and ZTARE as the substrate-mutator-judge workbench. Both repos open-sourced.
https://x.com/Daniel_Alami/status/2061091064367214889
Long thread, but the substance is rare. Nine-week solo project starting March 28 chasing Karpathy's autoresearch idea. Released ZTARE (Zero-Trust Adversarial Reasoning Engine): one AI proposes, a second tries to break it, a third referees, deterministic gates decide. Pointed it at four hard open problems — modified gravity, neural scaling laws, consciousness-ascription governance, Navier-Stokes — and what it returned was binding-constraint diagnoses, not positive laws. A score of 98 was the warning sign that the system was retrieving academic mainstream and dressing it up. About a month in he pointed the system at itself; a miner scores every primitive and feeds it back. Cognitive-Firm released as a second repo: three layers, with a Claude Code / Codex workforce as the Research Director and ZTARE as the substrate-mutator-judge workbench. Both repos open-sourced.
#6
@yibie
https://x.com/yibie/status/2061215797109002550
The autoresearch evidence scan now sits at 423 entries; 12 added this week. Highlights: auto-alphafold3 brings autoresearch into protein folding research; autoresearch-distillation trains an LLM via SDPO/GRPO to beat the original agent at autoresearch tasks; symphonic-autoresearch deploys the loop on OpenAI's Symphony orchestration framework; chess-autoresearcher proved chess-engine hyperparameters had reached a local optimum after 46 experiments; AutoResearch Trading Strategy as an autonomous discovery system for crypto futures; Autobrowse as a Karpathy-pattern application of browser agent memory. The breadth here is the signal.
https://x.com/yibie/status/2061215797109002550
The autoresearch evidence scan now sits at 423 entries; 12 added this week. Highlights: auto-alphafold3 brings autoresearch into protein folding research; autoresearch-distillation trains an LLM via SDPO/GRPO to beat the original agent at autoresearch tasks; symphonic-autoresearch deploys the loop on OpenAI's Symphony orchestration framework; chess-autoresearcher proved chess-engine hyperparameters had reached a local optimum after 46 experiments; AutoResearch Trading Strategy as an autonomous discovery system for crypto futures; Autobrowse as a Karpathy-pattern application of browser agent memory. The breadth here is the signal.
#7
@SPXTrades
https://x.com/SPXTrades/status/2061194017912823845
Concrete max-length run: 8-10 hour auto-research loop on a reverse-engineering task targeting CRC32 and SHA256. Worth marking because it's a clean tabletop benchmark for how far the loops are reaching outside the typical web/SaaS domain.
https://x.com/SPXTrades/status/2061194017912823845
Concrete max-length run: 8-10 hour auto-research loop on a reverse-engineering task targeting CRC32 and SHA256. Worth marking because it's a clean tabletop benchmark for how far the loops are reaching outside the typical web/SaaS domain.
#8
@KranenKyle
https://x.com/KranenKyle/status/2061004926617346483
Auto research applied to prototyping new scheduling algorithms inside Dynamo (NVIDIA's inference serving framework). Already used it to design some improvements and "starting to push into how far this can go." Auto-research as algorithm-design assistant in a real distributed systems context is a category jump.
https://x.com/KranenKyle/status/2061004926617346483
Auto research applied to prototyping new scheduling algorithms inside Dynamo (NVIDIA's inference serving framework). Already used it to design some improvements and "starting to push into how far this can go." Auto-research as algorithm-design assistant in a real distributed systems context is a category jump.
#9
@alokbishoyi97
https://x.com/alokbishoyi97/status/2061107898814808552
Open-sourced evo: plug it onto any repo, it discovers and suggests metrics to optimize, runs the autoresearch loop in parallel. Comes with gates so the autoresearch agents can't introduce unintended consequences. Distributable across whatever cloud infra you have, or local. The "gates against unintended consequences" line is the part this category has been missing — most loops happily Goodhart their rubric.
https://x.com/alokbishoyi97/status/2061107898814808552
Open-sourced evo: plug it onto any repo, it discovers and suggests metrics to optimize, runs the autoresearch loop in parallel. Comes with gates so the autoresearch agents can't introduce unintended consequences. Distributable across whatever cloud infra you have, or local. The "gates against unintended consequences" line is the part this category has been missing — most loops happily Goodhart their rubric.
#10
@LeeLeepenkman
https://x.com/LeeLeepenkman/status/2061171114135707660
"Working on autoresearch for stock trading: long-running tasks that use Chronos2-trained time-series forecasters or XGBoost/RL to learn trading algorithms." The frontier of autoresearch outside code is increasingly time-series finance — the rubric is unambiguous (PnL), and the search space is gigantic.
https://x.com/LeeLeepenkman/status/2061171114135707660
"Working on autoresearch for stock trading: long-running tasks that use Chronos2-trained time-series forecasters or XGBoost/RL to learn trading algorithms." The frontier of autoresearch outside code is increasingly time-series finance — the rubric is unambiguous (PnL), and the search space is gigantic.
#11
@Peaky8linders
https://x.com/Peaky8linders/status/2061187290685231269
Autoresearch applied to security: deploys agent swarms to scan and detect compliance and cybersec issues, run experiments with probes, and continuously harden systems under stressors. Real defensive use case, not a content gimmick.
https://x.com/Peaky8linders/status/2061187290685231269
Autoresearch applied to security: deploys agent swarms to scan and detect compliance and cybersec issues, run experiments with probes, and continuously harden systems under stressors. Real defensive use case, not a content gimmick.
#12
@kollisarath
https://x.com/kollisarath/status/2061184407432753295
"Building AI models for life sciences using auto research. Self-evolution agent systems will be the future." Limited detail but the destination is clear — pharma teams are starting to wire autoresearch into model discovery the same way ML teams have.
https://x.com/kollisarath/status/2061184407432753295
"Building AI models for life sciences using auto research. Self-evolution agent systems will be the future." Limited detail but the destination is clear — pharma teams are starting to wire autoresearch into model discovery the same way ML teams have.
#13
@AIImgGeneration
https://x.com/AIImgGeneration/status/2061149945458032740
Chrome CDP plus autoresearch fashion to iteratively build content extractors for specific websites (named X). After a few iterations, the extractors were good enough; net result was fewer articles to keep manually adding to Obsidian. Small-scale, but a clean home-cooked autoresearch use case in the personal knowledge management stack.
https://x.com/AIImgGeneration/status/2061149945458032740
Chrome CDP plus autoresearch fashion to iteratively build content extractors for specific websites (named X). After a few iterations, the extractors were good enough; net result was fewer articles to keep manually adding to Obsidian. Small-scale, but a clean home-cooked autoresearch use case in the personal knowledge management stack.
#14
@maxjendrall
https://x.com/maxjendrall/status/2061199125715255323
The clean technical definition: "/goal is an agent loop + verifier." Works only when the task has a binary/measurable yes-no completion check, and only with enough headroom on access and limits. On the 20x plan, with current promos, he's spending 1B+ tokens some days. The number is loud and consistent with the rest of the dataset.
https://x.com/maxjendrall/status/2061199125715255323
The clean technical definition: "/goal is an agent loop + verifier." Works only when the task has a binary/measurable yes-no completion check, and only with enough headroom on access and limits. On the 20x plan, with current promos, he's spending 1B+ tokens some days. The number is loud and consistent with the rest of the dataset.
#15
@ryancarson
https://x.com/ryancarson/status/2061050823593906659
The simplest one-line operationalization: "Karpathy's idea of auto research — the easiest way to implement it is using the /goal feature in codex (or whatever claude code calls it)." This is now the consensus framing across the higher-engagement threads.
https://x.com/ryancarson/status/2061050823593906659
The simplest one-line operationalization: "Karpathy's idea of auto research — the easiest way to implement it is using the /goal feature in codex (or whatever claude code calls it)." This is now the consensus framing across the higher-engagement threads.
#16
@aabyzov
https://x.com/aabyzov/status/2060900302979498271
Cautionary number for everyone running loops: "A runaway sub-agent loop once burned a month of my budget in 40 minutes before I added a hard ceiling. Caps are the new rate limit." The category needs durable execution patterns, not just better prompts.
https://x.com/aabyzov/status/2060900302979498271
Cautionary number for everyone running loops: "A runaway sub-agent loop once burned a month of my budget in 40 minutes before I added a hard ceiling. Caps are the new rate limit." The category needs durable execution patterns, not just better prompts.
#17
@danyurkin
https://x.com/danyurkin/status/2060968447962419599
Concrete agentic-loop benchmark: a trip-planning task that only completes if 7 specific tool calls fire. The model nailed 7/7 in 6.9 seconds on 4.8GB of RAM. For comparison the poster ran gpt-oss-20b on the same task and it dropped to 3/7. Multi-tool reliability is the metric people are actually starting to design loops around.
https://x.com/danyurkin/status/2060968447962419599
Concrete agentic-loop benchmark: a trip-planning task that only completes if 7 specific tool calls fire. The model nailed 7/7 in 6.9 seconds on 4.8GB of RAM. For comparison the poster ran gpt-oss-20b on the same task and it dropped to 3/7. Multi-tool reliability is the metric people are actually starting to design loops around.
#18
@sakurayukiai
https://x.com/sakurayukiai/status/2061208118931976320
"Qwen3.6-27B hits 77.2% on SWE-bench Verified while running on a single consumer GPU. The local agent loop is getting ridiculously cheap." If true, this collapses the cost case for a lot of mid-complexity always-on loops to almost nothing — and explains why the open-weight crowd is suddenly more dangerous.
https://x.com/sakurayukiai/status/2061208118931976320
"Qwen3.6-27B hits 77.2% on SWE-bench Verified while running on a single consumer GPU. The local agent loop is getting ridiculously cheap." If true, this collapses the cost case for a lot of mid-complexity always-on loops to almost nothing — and explains why the open-weight crowd is suddenly more dangerous.
#19
@trustable_ai
https://x.com/trustable_ai/status/2061135806928925107
Stack confession that maps the current open-vs-closed split: Claude for complex coding because it works, but everything else is being pushed toward open-weight models. Hermes specifically runs on open models with a solid agent loop and many skills — "very dangerous" — so it lives on a VPS with Obsidian synced via git. Personal sovereignty over the loop is becoming a design goal.
https://x.com/trustable_ai/status/2061135806928925107
Stack confession that maps the current open-vs-closed split: Claude for complex coding because it works, but everything else is being pushed toward open-weight models. Hermes specifically runs on open models with a solid agent loop and many skills — "very dangerous" — so it lives on a VPS with Obsidian synced via git. Personal sovereignty over the loop is becoming a design goal.
#20
@aboutlo
https://x.com/aboutlo/status/2061145933899829508
ds4-server vs ds4-agent take: "I don't get why pushing ds4-agent when you could leverage [existing harness] and create custom extensions for ds4 without recreating the agent loop from scratch." Real-time signal of how repetitive the new-agent-runtime pitch is starting to feel to power users.
https://x.com/aboutlo/status/2061145933899829508
ds4-server vs ds4-agent take: "I don't get why pushing ds4-agent when you could leverage [existing harness] and create custom extensions for ds4 without recreating the agent loop from scratch." Real-time signal of how repetitive the new-agent-runtime pitch is starting to feel to power users.
#21
@mustafaergisi
https://x.com/mustafaergisi/status/2061115493440704710
"The lock-in problem with AI tooling now isn't the model. It's the harness. My agent loop is 90% scaffolding and 10% prompt, and swapping the underlying model means rebuilding most of that scaffolding." Sharpest one-line articulation in this slice of why the harness is now the moat.
https://x.com/mustafaergisi/status/2061115493440704710
"The lock-in problem with AI tooling now isn't the model. It's the harness. My agent loop is 90% scaffolding and 10% prompt, and swapping the underlying model means rebuilding most of that scaffolding." Sharpest one-line articulation in this slice of why the harness is now the moat.
#22
@vipul_khatana_
https://x.com/vipul_khatana_/status/2061101797498900550
Implementation detail worth saving: log-writing as a hook on the agent loop. Fires on every tool call. The agent never "remembers to log" because writing is just part of acting. Structured extraction happens downstream off the critical path. Standard pattern but rarely articulated this cleanly.
https://x.com/vipul_khatana_/status/2061101797498900550
Implementation detail worth saving: log-writing as a hook on the agent loop. Fires on every tool call. The agent never "remembers to log" because writing is just part of acting. Structured extraction happens downstream off the critical path. Standard pattern but rarely articulated this cleanly.
#23
@IamPranavJ
https://x.com/IamPranavJ/status/2060912940669386791
"Benchmarks never capture the latency tax. 400ms vs 20ms per tool call doesn't move MMLU, but in an agent loop it compounds into the whole runtime. We moved the small models on-device and it changed our unit economics more than any accuracy gain did." Latency is becoming the real benchmark for production loops.
https://x.com/IamPranavJ/status/2060912940669386791
"Benchmarks never capture the latency tax. 400ms vs 20ms per tool call doesn't move MMLU, but in an agent loop it compounds into the whole runtime. We moved the small models on-device and it changed our unit economics more than any accuracy gain did." Latency is becoming the real benchmark for production loops.
#24
@oscmansan
https://x.com/oscmansan/status/2061233855961456784
The sharpest critique of the form: Karpathy's autoresearch is Sutton's Discovery loop — vary, evaluate, keep the best — but the proposals come from the generator's own prior, and the judge is a single scalar. So is it discovering, or just hill-climbing inside what the generator could already imagine? Worth holding while the rest of the field stacks new evidence.
https://x.com/oscmansan/status/2061233855961456784
The sharpest critique of the form: Karpathy's autoresearch is Sutton's Discovery loop — vary, evaluate, keep the best — but the proposals come from the generator's own prior, and the judge is a single scalar. So is it discovering, or just hill-climbing inside what the generator could already imagine? Worth holding while the rest of the field stacks new evidence.
#25
@jjcitron
https://x.com/jjcitron/status/2061218954438119461
The framing that's worth quoting whole: "Karpathy joined Anthropic this week to apply autoresearch at frontier scale. Same week the operator-tier discipline of running agents in production stopped being implicit and started being curriculum. The closed loop is the actual product. The model swap is the easy part."
https://x.com/jjcitron/status/2061218954438119461
The framing that's worth quoting whole: "Karpathy joined Anthropic this week to apply autoresearch at frontier scale. Same week the operator-tier discipline of running agents in production stopped being implicit and started being curriculum. The closed loop is the actual product. The model swap is the easy part."
#26
@YuLin807
https://x.com/YuLin807/status/2061188829567218003
Worth including the contrarian: "For the past six months everyone was talking about agent loop. For the past three months still agent loop. Now still agent automation. Eventually everyone will find out: agent loop is a false proposition. We're going back to human loop. Human-centric. I used to be obsessed with full automation too. Now in hindsight — pure waste of time. If you're still stuck on agent automation, I'd suggest you just give up." A real counter-voice in the dataset.
https://x.com/YuLin807/status/2061188829567218003
Worth including the contrarian: "For the past six months everyone was talking about agent loop. For the past three months still agent loop. Now still agent automation. Eventually everyone will find out: agent loop is a false proposition. We're going back to human loop. Human-centric. I used to be obsessed with full automation too. Now in hindsight — pure waste of time. If you're still stuck on agent automation, I'd suggest you just give up." A real counter-voice in the dataset.
#27
@veyhon
https://x.com/veyhon/status/2060949394464370726
A 14-day tutorial that hand-builds a Claude Code-style Agent CLI from scratch, one harness boundary per day: CLI runtime, agent loop, tool calls, permissions, file editing, command execution, session memory, hooks, skills, subagents, worktree, MCP. Days 1-7 produce a single-agent CLI; days 8-14 upgrade to multi-agent coordination, worktree isolation, MCP client. Best educational artifact in the slice for what the harness layer actually contains.
https://x.com/veyhon/status/2060949394464370726
A 14-day tutorial that hand-builds a Claude Code-style Agent CLI from scratch, one harness boundary per day: CLI runtime, agent loop, tool calls, permissions, file editing, command execution, session memory, hooks, skills, subagents, worktree, MCP. Days 1-7 produce a single-agent CLI; days 8-14 upgrade to multi-agent coordination, worktree isolation, MCP client. Best educational artifact in the slice for what the harness layer actually contains.
#28
@giginet
https://x.com/giginet/status/2060972236794888395
Smaller use case worth saving for the discipline: Icon Composer's *.icon rendering wasn't available headlessly, which made it impossible to put in a loop. Found ictool inside the App Bundle, registered it as a skill, and the agentic loop is now runnable on icon work. The discipline: identify the missing CLI step before you try to loop it.
https://x.com/giginet/status/2060972236794888395
Smaller use case worth saving for the discipline: Icon Composer's *.icon rendering wasn't available headlessly, which made it impossible to put in a loop. Found ictool inside the App Bundle, registered it as a skill, and the agentic loop is now runnable on icon work. The discipline: identify the missing CLI step before you try to loop it.
📡 Eco Products Radar
Eco Products Radar
Karpathy autoresearch — the source idea; functionally synonymous with /goal in this dataset.
/goal (Codex / Claude Code) — the de-facto autoresearch primitive; verified-task plus loop plus subscription gas.
Codex CLI — paired with Claude Code in autoresearch and parallel workstreams.
Claude Code — overnight loop runtime of choice for the higher-impression authors here.
Devin — used by ryancarson to run the full auto-research workflow before model switch.
evo (alokbishoyi97) — open-source autoresearch orchestrator with parallel runs plus consequence gates.
Hermes Agent — open-weight runtime, solid agent loop, sandboxed on a VPS.
Modal — host of the autoresearch hackathon producing the EEG brainwave case.
Dynamo (NVIDIA) — substrate for the auto-research-driven scheduling improvements.
ZTARE / Cognitive-Firm — adversarial-reasoning engine and organizational kernel, both open-sourced.
Obsidian — recurring local memory store paired with these loops.
Chronos2 — time-series forecaster used in the trading autoresearch loop.
Qwen3.6-27B — the open model whose SWE-bench number is making local agent loops cheap.
Karpathy autoresearch — the source idea; functionally synonymous with /goal in this dataset.
/goal (Codex / Claude Code) — the de-facto autoresearch primitive; verified-task plus loop plus subscription gas.
Codex CLI — paired with Claude Code in autoresearch and parallel workstreams.
Claude Code — overnight loop runtime of choice for the higher-impression authors here.
Devin — used by ryancarson to run the full auto-research workflow before model switch.
evo (alokbishoyi97) — open-source autoresearch orchestrator with parallel runs plus consequence gates.
Hermes Agent — open-weight runtime, solid agent loop, sandboxed on a VPS.
Modal — host of the autoresearch hackathon producing the EEG brainwave case.
Dynamo (NVIDIA) — substrate for the auto-research-driven scheduling improvements.
ZTARE / Cognitive-Firm — adversarial-reasoning engine and organizational kernel, both open-sourced.
Obsidian — recurring local memory store paired with these loops.
Chronos2 — time-series forecaster used in the trading autoresearch loop.
Qwen3.6-27B — the open model whose SWE-bench number is making local agent loops cheap.
Comments