May 24, 2026loop

Loop Daily: 2026-05-24

Two years ago "self-improving agent" was a phrase you rolled your eyes at. Yesterday it came with receipts. A memory system that researched its own retrieval policy and beat the best published baseline by double digits. A Tetris bot that rewrote itself ten times and got 56 percent better for a dollar thirty. And running under all of it, a sharpening argument about money: the cost of an agent loop doesn't scale with how much you use it, it scales with how deep the loop goes, and the metric that actually matters now is improvement-per-dollar, not improvement-per-run. Here's what people actually ran.

💡#1

@HuaxiuYaoML
https://x.com/HuaxiuYaoML/status/2057858935609319512
This is the strongest autoresearch result of the day and maybe the week. EvolveMem, shipping inside SimpleMem v0.3.0, points the loop at the agent's own memory: it treats the entire retrieval config as a structured action space and runs a closed loop of evaluate, diagnose, propose, validate, repeat. From a minimal baseline, seven autonomous rounds produced a retrieval policy that beat the strongest published baseline by 25.7 percent on LoCoMo and 18.9 percent on MemBench. The kicker is that it discovered entirely new retrieval dimensions that weren't in the original design. This is the clean version of the whole thesis: point an optimization loop at a measurable system and it finds things humans didn't think to try.

💡#2

@danyurkin
https://x.com/danyurkin/status/2057708211256308128
The experiment everyone was quoting: three models (Qwen 3.7-Max, Claude Opus 4.7, GPT-5.5) each given a self-improving Tetris bot to evolve over ten iterations of read-your-own-code, benchmark, rewrite. Qwen won and was 9x cheaper than Opus, landing +56 percent improvement at $1.32 against Claude's +28 percent at $12.15. The reason this lit up the timeline isn't the Qwen win, it's the format: a long agentic loop of code-bench-rewrite is the closest thing to a real engineering workflow a model test has simulated, and it surfaces cost-per-outcome as the number that decides production use. One caveat several people raised: it's a single task across ten rounds, so treat it as a stress test, not a verdict.

💡#3

@hansel_hansl
https://x.com/hansel_hansl/status/2057841112853942592
The single most useful production insight in the Loop firehose. Running agents at scale, he found token spend doesn't grow linearly with usage, it grows with agent loop depth: one ambiguous task that triggers three replan cycles costs more than a hundred clean ones. Budgets don't blow up on the demos, they blow up on the long tail of "the agent retried itself." This reframes the entire cost conversation, the lever isn't fewer tasks, it's killing the replan spirals, and it's the operational counterpart to all the macro hand-wringing about agent bills.

💡#4

@alokbishoyi97
https://x.com/alokbishoyi97/status/2057674180263555546
The clearest articulation of where autoresearch-as-a-product is heading, from the builder of EVO. He says customer conversations convinced him the real problem runs deeper than one-time optimization: people don't want a single autoresearch run, they want their systems to stay continuously tuned. So EVO is expanding to optimize anything an org runs, systems, code, agents, even models, with the long-term goal of being the platform teams use to run agents 24/7 and constantly tune everything they build. This is the shift from autoresearch as a one-shot tool to autoresearch as standing infrastructure.

💡#5

@jmschreiber91
https://x.com/jmschreiber91/status/2057847192904171751
A refreshingly honest autoresearch report. He went in skeptical that it would do more than overfit to a validation set, and indeed it did overfit his new architecture to the specific dataset, but it also discovered many generalizable things he would not have thought to try. That mix is the real picture of autoresearch right now: it'll happily exploit your eval if you let it, but it also reaches into parts of the search space you'd never manually explore. The value isn't blind trust, it's getting ideas you wouldn't have had.

💡#6

@alejadroHArt
https://x.com/alejadroHArt/status/2057839480065733119
A fascinating failure mode worth knowing before you trust the loop. He ran 40 generated ideas through a real autoresearch pipeline and watched Claude's reports drift from 2/10 to 8/10 focused on interpretability, with Gemini drifting the same way, while genuinely "alien" ideas stayed at 0/10 before and after. His read: iterative refinement pulls ideas toward a familiar attractor in concept space, so the loop quietly homogenizes toward what the model already knows. It's a sharp warning that an autoresearch loop can converge on the comfortable rather than the novel.

💡#7

@abhxy03
https://x.com/abhxy03/status/2057692112838349131
A detailed, reproducible self-improving workflow: pairing Hermes Agent with NotebookLM to build a "second brain" that researches, synthesizes, and teaches itself. The core mechanism is Hermes's learning loop, you demonstrate a workflow once or twice, it analyzes what worked, then writes a new persistent skill so the whole chain becomes a one-prompt command forever. His worked example is a daily knowledge-ingestion pipeline that scans a YouTube feed, picks the best sources, and loads them into NotebookLM automatically. This is self-improvement at the practical end, the agent isn't rewriting its weights, it's writing its own reusable skills.

💡#8

@ds_bun_
https://x.com/ds_bun_/status/2057965731594314084
The non-coding application of the day: using autoresearch to optimize marketing campaigns under budget constraints, framed as "let the AI do the experimenting." It's a short post pointing at a writeup, but it matters because it drags autoresearch out of the kernel-and-architecture world and into a domain where the loop's evaluate-and-iterate shape maps perfectly onto A/B testing with a spend ceiling. Marketing optimization is exactly the kind of measurable, file-editable problem that autoresearch was built to eat.

💡#9

@levidiamode
https://x.com/levidiamode/status/2057847703875338329
Day 139 of 365 of GPU programming, and a nice window into someone adopting autoresearch mid-project. He'd deliberately taken a manual approach to his Qwen inference-optimization work to learn from the ground up rather than automate away his own understanding, and is now feeling out how autoresearch applies to inference, getting a feel for the repo, the program.md and scratchpad.md structure. The honesty about wanting to understand before automating is the healthy version of loop adoption, and the program.md / scratchpad.md pattern is becoming the standard scaffolding for these runs.

💡#10

@sang_wen
https://x.com/sang_wen/status/2057872262079115715
The origin-story version of an agentic loop paying off. Genspark's CTO reportedly tested the agentic loop for two years and watched every model fail, until one night one didn't, at which point they rebuilt everything around a single agent with 150-plus tools and, they claim, reached $250M ARR in 12 months. Take the numbers with the usual skepticism, but the shape is the interesting part: a single, deep, well-tooled loop crossing a reliability threshold all at once rather than improving gradually. That step-change pattern is what a lot of people are now waiting for in their own stacks.

💡#11

@rhelmerdotorg
https://x.com/rhelmerdotorg/status/2057642655555969433
A clean piece of loop infrastructure work: he ported Hermes to run on AWS Lambda using DynamoDB for chat history, S3 for skills, EventBridge for cron jobs, and Telegram webhooks, keeping the same agent loop with no always-on server needed. Notably he still prefers his Hetzner VPS for the primary instance because it's more reliable, which is a useful honest note on the serverless-versus-VPS tradeoff for long-running agents. For anyone trying to run an agent loop cheaply without a box that's always on, this is a concrete recipe.

💡#12

@DanKornas
https://x.com/DanKornas/status/2057694031199510539
Async Code Agent is a self-hostable system for running coding agents in parallel instead of babysitting one loop at a time, with a Codex-style web UI. You submit multiple tasks, run Claude Code and other agents side by side for comparison, review the outputs, and turn the successful runs into Git commits or PRs, each in its own sandboxed Docker container. It's open source under Apache 2.0. The core idea, that the bottleneck in agent work is now serial execution and you fix it by fanning out and comparing, is the same instinct showing up across the parallel-agent tools this week.

💡#13

@bryonkuchML
https://x.com/bryonkuchML/status/2057891813331959828
A small but practical contribution to the autoresearch tooling layer: he found it hard to use the prompt-optimization and autoresearch techniques he liked with his own agent stack (LangChain), so he built a GEPA Adapter package that lets LangChain agents and models work with GEPA directly. This is the unglamorous plumbing that actually grows a method's adoption, GEPA-style optimization is spreading, and the gaps are now in the adapters between it and the frameworks people already use. Expect more of these connective packages as autoresearch goes mainstream.

💡#14

@johniosifov
https://x.com/johniosifov/status/2057815509165351273
The macro framing for everything above. His argument: cheaper tokens don't reduce AI bills, they increase usage, because moving from one call per action to an agentic workflow of 10-20 calls per task flips the economics, your bill goes up, just with more output. He pins 85 percent of enterprise AI budget on inference now, and argues the companies that win build LLM-efficient architectures that accomplish more per token rather than more tokens per task. It's the same lesson @hansel_hansl found in production, stated as a strategy thesis: the second generation of AI products is the one where loop unit-economics actually has to work.

📡 Eco Products Radar

Eco Products Radar

Qwen 3.7-Max — the surprise winner of the self-improving Tetris benchmark, dominating on both improvement and cost-per-outcome in a long code-bench-rewrite loop
Hermes (Nous Research) — recurring as the runtime for self-improving workflows, from the NotebookLM second-brain to AWS Lambda ports, with its skill-learning loop the most-cited feature
EVO (@alokbishoyi97) — the autoresearch orchestrator positioning itself as the always-on platform to continuously tune systems, code, agents, and models
GEPA — the prompt/system optimization method spreading through the stack, now with community adapters for LangChain and other frameworks
SimpleMem / EvolveMem — the memory package whose self-researching retrieval loop produced the day's headline autoresearch gains
NotebookLM — paired with agents as the synthesis layer in self-improving "second brain" setups

← Previous

Super User Daily: 2026-05-24

Ideas Radar: 2026-05-24

← Back to all articles

Loop Daily: 2026-05-24

Related Articles

Comments