Loop Daily: May 25, 2026
The autoresearch loop went from Karpathy's weekend repo to a thing people are pointing at their own problems, and on May 23 the spread was wider than the code: a finetuned embedding model that beats everything in retrieval, a Polymarket bot that rewrites its own strategy every night, firmware climbing the last 1% through compiler-config search, and a whole school of thought about running your company as a set of self-improving loops. The other half of the conversation was the bill, every loop iteration burns 10-100x the tokens of a single prompt, and the people actually running these loops overnight are the ones obsessing over caching, context depollution, and where the spend goes. Here's who's building the loop and what they learned.
#1
@alokbishoyi97
https://x.com/alokbishoyi97/status/2058054065205182577
He open-sourced evo, an autoresearch and optimization platform built on Karpathy's idea, and it's already installed on nearly 6,000 systems with 700+ GitHub stars. The pitch is to make autoresearch practical for normal people: parallel agents doing tree search, gates to prevent unwanted behavior, runnable on any infra (AWS, Azure, Modal, e2b). Users are reporting scientific SOTA results and unexpected optimizations on systems they'd already hand-tuned for a long time. In a follow-up he's experimenting with adding a reviewer every epoch when the orchestrator fans out subagents, an advisor-mode pattern he's seeing in setups like tobi's and the lossfunk team's autoresearch loops.
https://x.com/alokbishoyi97/status/2058054065205182577
He open-sourced evo, an autoresearch and optimization platform built on Karpathy's idea, and it's already installed on nearly 6,000 systems with 700+ GitHub stars. The pitch is to make autoresearch practical for normal people: parallel agents doing tree search, gates to prevent unwanted behavior, runnable on any infra (AWS, Azure, Modal, e2b). Users are reporting scientific SOTA results and unexpected optimizations on systems they'd already hand-tuned for a long time. In a follow-up he's experimenting with adding a reviewer every epoch when the orchestrator fans out subagents, an advisor-mode pattern he's seeing in setups like tobi's and the lossfunk team's autoresearch loops.
#2
@cryptof4ck
https://x.com/cryptof4ck/status/2058095833631863175
A trader turned a Polymarket account into a money machine with a self-improving loop, and the loop is the whole edge. The stack is small, Claude Opus 4.7 as the brain reading signals, an open-source Hermes agent body, a cheap VPS and Telegram alerts, focused on fast BTC 5-minute up/down markets using Markov-chain persistence analysis. The dangerous part runs nightly: the agent reviews the full trade journal, analyzes every win and loss, and automatically updates its probability thresholds, Kelly sizing, and edge requirements. Every cycle it gets sharper, which is exactly the autoresearch loop applied to live capital instead of training code.
https://x.com/cryptof4ck/status/2058095833631863175
A trader turned a Polymarket account into a money machine with a self-improving loop, and the loop is the whole edge. The stack is small, Claude Opus 4.7 as the brain reading signals, an open-source Hermes agent body, a cheap VPS and Telegram alerts, focused on fast BTC 5-minute up/down markets using Markov-chain persistence analysis. The dangerous part runs nightly: the agent reviews the full trade journal, analyzes every win and loss, and automatically updates its probability thresholds, Kelly sizing, and edge requirements. Every cycle it gets sharper, which is exactly the autoresearch loop applied to live capital instead of training code.
#3
@SergheiLefter
https://x.com/SergheiLefter/status/2058226794965221699
One person, iteratively and with auto-research, finetuned an embedding model specifically for long transcripts, tuned for semantic turns and positive/negative semantic keywords. He claims it beats anything available in nDCG and retrieval by a margin, with zero downloads because it's all internal use. This is the quiet version of the autoresearch promise, not a paper or a launch, just an individual out-optimizing the public models on a narrow problem that actually matters to them.
https://x.com/SergheiLefter/status/2058226794965221699
One person, iteratively and with auto-research, finetuned an embedding model specifically for long transcripts, tuned for semantic turns and positive/negative semantic keywords. He claims it beats anything available in nDCG and retrieval by a margin, with zero downloads because it's all internal use. This is the quiet version of the autoresearch promise, not a paper or a launch, just an individual out-optimizing the public models on a narrow problem that actually matters to them.
#4
@levidiamode
https://x.com/levidiamode/status/2058252463229071587
Day 140 of his GPU-programming year, he ran his first experiments with an autoresearch-style loop and the immediate lesson was that monitoring becomes a critical side task. To watch the task queue, GPU utilization, bottlenecks, errors, hypotheses, and validations, he hacked together a lightweight live dashboard like TensorBoard for GPU state, plus a separate HTML dashboard updated every 10 minutes with current hypotheses, past experiments, and a running FAQ of questions he's asked. It's not the most efficient setup, but it gives him real insight into what Claude and Codex are actually trying to optimize, which is the part most loop demos skip.
https://x.com/levidiamode/status/2058252463229071587
Day 140 of his GPU-programming year, he ran his first experiments with an autoresearch-style loop and the immediate lesson was that monitoring becomes a critical side task. To watch the task queue, GPU utilization, bottlenecks, errors, hypotheses, and validations, he hacked together a lightweight live dashboard like TensorBoard for GPU state, plus a separate HTML dashboard updated every 10 minutes with current hypotheses, past experiments, and a running FAQ of questions he's asked. It's not the most efficient setup, but it gives him real insight into what Claude and Codex are actually trying to optimize, which is the part most loop demos skip.
#5
@learnwithella
https://x.com/learnwithella/status/2058246554520289352
She's running self-improving Claude Code skills and the loop is clean: one run fires the skill 10 times with varied inputs, a separate evaluator scores every output against 3-5 binary criteria, identifies the most common failure patterns, rewrites the skill prompt, retests, and keeps the winner until the score plateaus. A hook-writer skill went from 32/50 to 47/50 overnight, no manual prompt tweaking. Her framing is the useful one, this is the same loop AI labs use to improve their own models, pointed at creative DTC workflows where a skill is great 70% of the time and unusable the other 30%, and it kills the entire "it worked once but I can't reproduce it" problem.
https://x.com/learnwithella/status/2058246554520289352
She's running self-improving Claude Code skills and the loop is clean: one run fires the skill 10 times with varied inputs, a separate evaluator scores every output against 3-5 binary criteria, identifies the most common failure patterns, rewrites the skill prompt, retests, and keeps the winner until the score plateaus. A hook-writer skill went from 32/50 to 47/50 overnight, no manual prompt tweaking. Her framing is the useful one, this is the same loop AI labs use to improve their own models, pointed at creative DTC workflows where a skill is great 70% of the time and unusable the other 30%, and it kills the entire "it worked once but I can't reproduce it" problem.
#6
@samrexford
https://x.com/samrexford/status/2058293501771846114
He adapted Karpathy's autoresearch into a skill called /autodev and published it on GitHub: the agent builds, evaluates, iterates, and never stops until the feature is complete, verifying correctness at every step. His honest report is the texture you want, walking away and coming back to 10 commits is nerve-wracking but has been great in his stack. He front-loaded it with a startup command block that makes the AI adapt to your stack and risk tolerance and then self-destructs once done, while admitting he's not sure how that will hold up. Real loop, real uncertainty, shipped anyway.
https://x.com/samrexford/status/2058293501771846114
He adapted Karpathy's autoresearch into a skill called /autodev and published it on GitHub: the agent builds, evaluates, iterates, and never stops until the feature is complete, verifying correctness at every step. His honest report is the texture you want, walking away and coming back to 10 commits is nerve-wracking but has been great in his stack. He front-loaded it with a startup command block that makes the AI adapt to your stack and risk tolerance and then self-destructs once done, while admitting he's not sure how that will hold up. Real loop, real uncertainty, shipped anyway.
#7
@LeeLeepenkman
https://x.com/LeeLeepenkman/status/2057979256999927954
He built a Codex auto-research fork and is pointing it at a spread of hard problems at once: beating the stock market with a stock-prediction repo, "parameter golf" on a tiny LLM, and optimizing diffusion. The interesting move is treating auto-research as a general-purpose engine rather than a single experiment, one forked harness aimed at finance, model compression, and generative models in parallel. It's an early, scrappy look at what a personal autoresearch lab looks like.
https://x.com/LeeLeepenkman/status/2057979256999927954
He built a Codex auto-research fork and is pointing it at a spread of hard problems at once: beating the stock market with a stock-prediction repo, "parameter golf" on a tiny LLM, and optimizing diffusion. The interesting move is treating auto-research as a general-purpose engine rather than a single experiment, one forked harness aimed at finance, model compression, and generative models in parallel. It's an early, scrappy look at what a personal autoresearch lab looks like.
#8
@seevali
https://x.com/seevali/status/2058129411015397871
He ran an overnight agent loop that produced real commits and burned his weekly quota faster than just coding it himself, which is the honest cost story of looping. The leak he found is precise: `claude --max-turns 1 "say hi"` consumes 68K tokens before your prompt even lands. The fix was prompt caching, which dropped cost to roughly 4%. This is the unglamorous economics that decides whether an overnight loop is genius or just an expensive way to lose money, and it's tagged to the RalphLoop pattern of letting an agent grind autonomously.
https://x.com/seevali/status/2058129411015397871
He ran an overnight agent loop that produced real commits and burned his weekly quota faster than just coding it himself, which is the honest cost story of looping. The leak he found is precise: `claude --max-turns 1 "say hi"` consumes 68K tokens before your prompt even lands. The fix was prompt caching, which dropped cost to roughly 4%. This is the unglamorous economics that decides whether an overnight loop is genius or just an expensive way to lose money, and it's tagged to the RalphLoop pattern of letting an agent grind autonomously.
#9
@navalpodcast
https://x.com/navalpodcast/status/2058307106584072653
This executive brief on Tom Blomfield's "Burn Tokens, Not Headcount" talk is the clearest articulation of the loop as an operating model for a whole company, not just a coder. The thesis: the AI-native company is a set of recursive, self-improving loops, sense the world, decide, use tools, pass quality gates, learn from the result, loop again. The "holy-shit moment" isn't an agent answering a question, it's a monitoring agent watching every failed query and shipping the next version of the system, finding the bug, updating the skill file, opening the PR, reviewing it, merging, deploying, all while you sleep. Burn tokens instead of headcount, keep humans at the edge where judgment matters.
https://x.com/navalpodcast/status/2058307106584072653
This executive brief on Tom Blomfield's "Burn Tokens, Not Headcount" talk is the clearest articulation of the loop as an operating model for a whole company, not just a coder. The thesis: the AI-native company is a set of recursive, self-improving loops, sense the world, decide, use tools, pass quality gates, learn from the result, loop again. The "holy-shit moment" isn't an agent answering a question, it's a monitoring agent watching every failed query and shipping the next version of the system, finding the bug, updating the skill file, opening the PR, reviewing it, merging, deploying, all while you sleep. Burn tokens instead of headcount, keep humans at the edge where judgment matters.
#10
@mrru5s3ll
https://x.com/mrru5s3ll/status/2058081192671691237
Honey-Comb is one of the most thought-out attacks on the real bottleneck of long agent loops: context bloat. It does CPU-only inline context depollution before anything hits the model, classifying every message entering the loop as CORE, DISTILL, COMPACT, or DROP in under 1.5ms, then using deterministic regex extractors to strip it down, no LLM summarizer, no ad-hoc compression threshold. In a 10-turn coding-agent session it collapsed a 514-token file read to 5 tokens and a 60-line test failure to 93, taking the whole session from 4,062 tokens to 640, a 6.3x reduction with 84% of the noise gone. It's honest about the limit, it works on structured tool outputs, not free-form chat, and a mislabel can lose data, but it's running in production with real benchmarked throughput.
https://x.com/mrru5s3ll/status/2058081192671691237
Honey-Comb is one of the most thought-out attacks on the real bottleneck of long agent loops: context bloat. It does CPU-only inline context depollution before anything hits the model, classifying every message entering the loop as CORE, DISTILL, COMPACT, or DROP in under 1.5ms, then using deterministic regex extractors to strip it down, no LLM summarizer, no ad-hoc compression threshold. In a 10-turn coding-agent session it collapsed a 514-token file read to 5 tokens and a 60-line test failure to 93, taking the whole session from 4,062 tokens to 640, a 6.3x reduction with 84% of the noise gone. It's honest about the limit, it works on structured tool outputs, not free-form chat, and a mislabel can lose data, but it's running in production with real benchmarked throughput.
#11
@akshay_krips
https://x.com/akshay_krips/status/2058286616339460251
A concrete, narrow win: Codex writes extremely good firmware and gets to about 95% of maximal performance within the first few iterations, and the last 99% can be climbed using sufficient autoresearch loops to find the optimal build and compiler configurations. It's a small tweet but a clean example of the loop doing the tedious final-percent search a human would never have the patience for, in a domain where the search space of configs is exactly what autoresearch is good at.
https://x.com/akshay_krips/status/2058286616339460251
A concrete, narrow win: Codex writes extremely good firmware and gets to about 95% of maximal performance within the first few iterations, and the last 99% can be climbed using sufficient autoresearch loops to find the optimal build and compiler configurations. It's a small tweet but a clean example of the loop doing the tedious final-percent search a human would never have the patience for, in a domain where the search space of configs is exactly what autoresearch is good at.
#12
@HanifCarroll
https://x.com/HanifCarroll/status/2058174111436706117
His notes from building with agents read like a maturing playbook for running loops in production. He no longer runs git or shell commands himself, the agent handles all of it; for bigger refactors he runs agents autonomously in parallel and decides what "done" looks like before he starts, so there's an actual stop condition. The rest is loop hygiene, prefer the correct long-term shape over half-measures, keep files under ~1,500 lines and run refactor rounds when they creep past, and when the LLM keeps producing bad output, hand the result to a second LLM that cleans it up rather than piling on rules. His job has shifted to deciding what good looks like and keeping the system from drifting.
https://x.com/HanifCarroll/status/2058174111436706117
His notes from building with agents read like a maturing playbook for running loops in production. He no longer runs git or shell commands himself, the agent handles all of it; for bigger refactors he runs agents autonomously in parallel and decides what "done" looks like before he starts, so there's an actual stop condition. The rest is loop hygiene, prefer the correct long-term shape over half-measures, keep files under ~1,500 lines and run refactor rounds when they creep past, and when the LLM keeps producing bad output, hand the result to a second LLM that cleans it up rather than piling on rules. His job has shifted to deciding what good looks like and keeping the system from drifting.
#13
@sitin_dev
https://x.com/sitin_dev/status/2058070673155649817
A clean explanation of why Karpathy's autoresearch matters: you give an AI agent a real, small-scale LLM training task and let it run the full research loop, it edits the training code, runs ~5-minute experiments on a single GPU, checks validation metrics, keeps the change if it improves and reverts it if it doesn't. So instead of using AI only to write code, the agent is actually doing iterative research, propose, run, evaluate, decide, repeat. His framing is the right one, this is an early glimpse of agents becoming junior researchers inside controlled experimental environments.
https://x.com/sitin_dev/status/2058070673155649817
A clean explanation of why Karpathy's autoresearch matters: you give an AI agent a real, small-scale LLM training task and let it run the full research loop, it edits the training code, runs ~5-minute experiments on a single GPU, checks validation metrics, keeps the change if it improves and reverts it if it doesn't. So instead of using AI only to write code, the agent is actually doing iterative research, propose, run, evaluate, decide, repeat. His framing is the right one, this is an early glimpse of agents becoming junior researchers inside controlled experimental environments.
π‘ Eco Products Radar
Eco Products Radar
evo: open-source autoresearch/optimization platform built on Karpathy's idea, ~6,000 installs and 700+ stars, parallel tree-search agents with behavior gates, runs on AWS/Azure/Modal/e2b. The breakout autoresearch tool of the week.
Karpathy's autoresearch: the reference repo and pattern (edit code, run 5-min GPU experiments, keep-or-revert on a metric) that nearly every loop project this week forks, adapts into a skill, or cites as the starting point.
Managed Agents (Google + Anthropic): the agent loop moving server-side with token-rate billing and sandboxed harnesses, repeatedly flagged as turning the harness from a framework choice into a model feature.
Claude Code / Codex: the two harnesses people actually run their loops on, Claude Code for self-improving skills and overnight loops, Codex forked for auto-research and firmware config search.
Hermes Agent: the local-first open-source body for self-improving loops, showing up as the execution layer under the Polymarket trading loop and persistent self-improving workspaces.
evo: open-source autoresearch/optimization platform built on Karpathy's idea, ~6,000 installs and 700+ stars, parallel tree-search agents with behavior gates, runs on AWS/Azure/Modal/e2b. The breakout autoresearch tool of the week.
Karpathy's autoresearch: the reference repo and pattern (edit code, run 5-min GPU experiments, keep-or-revert on a metric) that nearly every loop project this week forks, adapts into a skill, or cites as the starting point.
Managed Agents (Google + Anthropic): the agent loop moving server-side with token-rate billing and sandboxed harnesses, repeatedly flagged as turning the harness from a framework choice into a model feature.
Claude Code / Codex: the two harnesses people actually run their loops on, Claude Code for self-improving skills and overnight loops, Codex forked for auto-research and firmware config search.
Hermes Agent: the local-first open-source body for self-improving loops, showing up as the execution layer under the Polymarket trading loop and persistent self-improving workspaces.
Comments