May 28, 2026loop

Loop Daily: May 29, 2026

Today the loop stopped being a Karpathy tweet and started being a thing people run. OpenAI threw an autoresearch hackathon the same week Karpathy signed with Anthropic, evo opened its autoresearch platform to anyone, and underneath the noise the actual case studies got heavy: a database team ran nearly two dozen optimization experiments overnight and woke up to a measured precision jump, NVIDIA Research walked through agent loops that write prize-winning CUDA kernels and optimize hardware that does not physically exist, and people started pointing the same iterate-score-loop machinery at their own daily logs, their browser skills, even at finding catastrophic bugs. The through-line is a quiet redefinition of continual learning. It is not weight updates anymore, it is the harness and the context becoming trainable state that improves itself while the work is still running. The interesting question has moved from how smart is the model to how tight is your loop.
πŸ’‘#1
@datadogdevs
https://x.com/datadogdevs/status/2059740228957458924
The single cleanest proof of the autoresearch thesis today. Datadog's database monitoring team pointed autoresearch plus LLM Observability at a SQL optimization agent and ran 23 experiments overnight, taking precision from 0.54 to 0.86. The how is the part worth copying: they tested prompts, tools and workflows continuously, compared traces across models to see exactly where the reasoning failed, used evals to measure every single change, and finally split the agent into two passes to cut false positives. This is not a thought experiment about self-improving AI, it is a production team leaving a measurable optimization loop running while they slept and waking up to a near-60-percent relative jump in precision.
πŸ’‘#2
@lmsysorg
https://x.com/lmsysorg/status/2059758375257489742
NVIDIA Research scientist Ligeng Zhu is walking through Humanize, an agentic flow framework that lets agents run on their own to tackle complex engineering and research problems the way a human would, and he brings three long-running agent-loop case studies that are genuinely heavy. KDA wrote fast CUDA kernels that ranked top three on the MLSys FlashInfer Kernel Contest, a virtual-hardware project optimized computation on hardware that does not physically exist yet, and JetAutoResearch used ahead-of-time compilation to cut more than 50 percent of the AutoResearch workflow cost. The framing of the talk, agent loops as the thing that turns tokens into productivity, is exactly the question this whole space is circling. These are loops measured in hours, not seconds.
πŸ’‘#3
@alokbishoyi97
https://x.com/alokbishoyi97/status/2059612002595840190
Alok Bishoyi opened beta access to evo's autoresearch platform, and unlike most launch posts he published a real example with all the traces and logs attached, showing how you use autoresearch to improve your agents' skills over time. evo is the open-source engine behind it, built on Karpathy's autoresearch idea, using parallel agents, tree search and dashboards to automate, analyze and improve codebases, and it is already adopted across thousands of projects. The reason this matters is that autoresearch has mostly been a Karpathy tweet and a vague aspiration, and evo is one of the first attempts to make it a thing a normal person can actually run, on AWS, Azure, Modal or e2b.
πŸ’‘#4
@alokbishoyi97
https://x.com/alokbishoyi97/status/2059643660581621831
The most fascinating use of evo Alok has seen: someone runs an autoresearch loop as a cron job on their own daily agent logs, not on code, but on their private workflows, Notion and Obsidian habit entries, email patterns, and personal benchmark tasks. The loop evolves hyper-personalized Skills that get better at that person's exact style of task management, research and writing. You walk away and come back to a noticeably sharper version of your own agent. This is the part of autoresearch that escapes the coding box, the same evolutionary loop the labs use on models, pointed at your own daily life, compounding while you are not looking.
πŸ’‘#5
@kylejeong
https://x.com/kylejeong/status/2059753008297394245
A blunt reminder that the iterative autoresearch loop is not just for model training, you can point it at your own skills. His team ran iterative AutoResearch on their browser skills and produced /autobrowse, making those skills up to 90 percent faster and cheaper to run. That is the whole pitch in one sentence: take a working skill, let an automated loop hammer on it against your own metrics, and walk out with something an order of magnitude cheaper that does the same job. The 90 percent number is the kind of result that makes the autoresearch hype feel earned rather than aspirational.
πŸ’‘#6
@beuchelt
https://x.com/beuchelt/status/2059455802939736189
A Microsoft paper called SkillOpt treats SKILL.md documents as trainable external state and optimizes them with the discipline of deep-learning optimizers, but entirely in text space, no model fine-tuning. A separate optimizer model analyzes scored rollouts from the frozen target agent and proposes only bounded add, delete or replace edits to one skill doc, accepted only when they beat a held-out validation score, with a textual learning rate and a rejected-edit buffer for stability. Across six benchmarks, seven models and three harnesses including Codex and Claude Code, it took best or tied in all 52 combinations, lifting GPT-5.5 by +24.8 points inside the Codex loop. Crucially the optimized skill transfers across models and environments, which is the real prize, you optimize once and the artifact keeps paying off.
πŸ’‘#7
@daniel_mac8
https://x.com/daniel_mac8/status/2059466060697354599
The clearest articulation today of where continual learning is actually heading, and it is not weight updates. The picture is: very large long-lived context windows treated as learnable fast weights, plus harness optimization where skills, prompts, tools, evals and workflows all become trainable state. Picture an enterprise agent where the base model carries the slow weights and the context carries the fast weights, org knowledge, project history, logs, eval results, tool traces, learned skills. Then the agent loop improves that state while the work is happening, not after the task, during it. A days-long run improves its own context and workflow as it goes, and because the improvement step is part of the loop, you eventually get a real autonomous agent that learns by optimizing the world around the model. His bet is we see this before the end of 2026.
πŸ’‘#8
@SHL0MS
https://x.com/SHL0MS/status/2059749890620620851
A delightfully weird application of the autoresearch idea: he developed a method that resembles gain-of-function research crossed with Karpathy's autoresearch, but aimed at finding and mutating catastrophic Unicode bugs. Instead of optimizing a model or a skill, he is using the iterate-mutate-evaluate loop as an adversarial fuzzer, evolving inputs that break things in interesting ways. It is a reminder that the autoresearch loop is a general primitive, anything with an editable artifact and a measurable signal can be turned into an evolutionary search, including security and bug-hunting, not just model and prompt tuning.
πŸ’‘#9
@AradhyeAgarwal
https://x.com/AradhyeAgarwal/status/2059643175946576140
A small honest experiment with a finding worth more than most success stories. Trying to build a video-quality filter, he coded an agent loop that reads a video lazily frame by frame through sequential tool calls, giving the agent a 20-call budget. On a 30-second clip of nearly 900 frames, instead of intelligently sampling, the agent just walked through every hundredth frame and stopped at 800, even when explicitly told to use all 20 turns, and this happened even with frontier models like GPT-5.4. His read is that the image-heavy context grew so large that instruction-following and reasoning degraded. The takeaway: we need agentic training loops integrated aggressively with visual inputs, because the loop only works if the model can still reason once the context is full of pixels.
πŸ’‘#10
@wesbos
https://x.com/wesbos/status/2059625611623043435
On the podcast with Alex and Amadeus from Pierre Computer, the standout was how they used pi autoresearch for performance, which Wes called genius. The broader frame is that fast, well-done primitives, specifically trees and diffs, are now the shared substrate under Claude, Codex and OpenCode, and pi is the harness people keep reaching for when they want to run their own optimization loops on those primitives. It is a useful signal that autoresearch is quietly becoming a performance-engineering tool, not just a research curiosity, used by teams who care about shaving real latency and cost off agent systems.
πŸ’‘#11
@MattWil12
https://x.com/MattWil12/status/2059555417953370605
Prudentia, a deterministic AI co-pilot built only for the European financial sector, leans on a two-tier agentic loop for its gap analysis: a junior agent scans pages while a senior agent cross-references to kill false positives, auditing a 120-page document against EU law and regulations in minutes. The legal-hierarchy awareness, mapping that Level 1 regulations override Level 3 soft law directly into vector space, and one-click verification hard-linking every claim to the exact paragraph, are the kind of guardrails a probabilistic word predictor needs before it touches compliance. It is a clean example of an agentic loop doing high-stakes non-coding work where being confidently wrong is a liability, not a quirk.
πŸ’‘#12
@DivyanshGandhi
https://x.com/DivyanshGandhi/status/2059701390138843136
He has been running his own variant of autoresearch he calls GSL, Graph, Score, Loop. Every meeting, decision and artefact gets translated into a context graph, then scored, then looped upon. It is autoresearch applied not to code or a model but to the raw material of how an organization thinks and decides, turning the messy stream of work into a structured, continuously re-evaluated graph. The pattern keeps showing up today, people independently reinventing the same loop, take your own exhaust, score it, feed it back, and let the structure get sharper over time without a human in the middle of every pass.
πŸ’‘#13
@Rohit_Writes
https://x.com/Rohit_Writes/status/2059456302410355043
The best articulation of what is still missing for autoresearch to be usable, framed as a wishlist because no open-source all-in-one platform exists yet and Codex /goal is not enough. He wants a reward-hacking monitor, because his biggest issue is leaving a loop running for 12 hours and coming back to a degenerate solution; human-escalation notifications that ping his phone when the agent needs more compute or data; multi-goal setup tied to a Linear project with milestones; adaptive telemetry; and research taste, a platform that proposes future directions and learns from his thumbs up and down. This is the real product spec for the autoresearch tooling everyone keeps gesturing at, written by someone who has actually been burned by an unattended loop.
πŸ’‘#14
@TeksCreate
https://x.com/TeksCreate/status/2059568807190892690
ByteDance dropped deer-flow, which he argues is not just another agent framework but a SuperAgent Harness that researches, codes and creates autonomously over hours-long tasks, 69.7K stars in three weeks. The architecture is the interesting part: instead of a single agent loop it uses sandboxed execution environments per subtask, persistent memory across sessions, a message gateway for inter-agent communication, and subagents that can spawn their own subagents. He has been testing it by feeding ArXiv papers and asking for executable implementations, and reports surprisingly coherent results on multi-hour runs. MIT licensed, built on LangGraph. This is the infrastructure layer the long-running autoresearch loop actually needs.
πŸ’‘#15
@Ventali
https://x.com/Ventali/status/2059748779365187671
They wanted an agent loop in the browser, so they built edgent and open-sourced it, a headless browser agent SDK that is CodeMirror-native, bring-your-own-model, MIT licensed. It is a small but telling entry in today's pattern, the agent loop is escaping the terminal and the IDE and moving into the browser as a reusable primitive other people can build on. The browser is where most knowledge work actually happens, so a clean, model-agnostic loop that runs there is exactly the kind of substrate that makes non-coding autoresearch and automation practical.
πŸ“‘ Eco Products Radar
Eco Products Radar

evo (@EVO__HQ, by Alok Bishoyi): the most-mentioned autoresearch project of the day, an open-source engine built on Karpathy's idea using parallel agents, tree search and dashboards to optimize codebases and skills. Now in open beta and adopted across thousands of projects.

pi / pi autoresearch (Pierre Computer): the harness people reach for to run optimization loops on the trees-and-diffs primitives that underpin Claude, Codex and OpenCode. Repeatedly cited for performance engineering, with pi-mono/agent praised as a minimal teaching-grade agent loop.

Emerging long-running harnesses: deer-flow (ByteDance, MIT, 69.7K stars in three weeks) as a SuperAgent harness with per-subtask sandboxes and subagents that spawn subagents; Humanize (NVIDIA, Ligeng Zhu) for hours-long engineering and research loops; and SkillOpt (Microsoft) plus MUSE-Autoskill as the research backbone for treating skills as trainable state. The framework layer for autoresearch is filling in fast.
← Previous
Super User Daily: May 29, 2026
Next β†’
Ideas Radar: May 29, 2026
← Back to all articles

Comments

Loading...
>_