May 27, 2026loop

Loop Daily: May 28, 2026

If there's one idea taking over the agent crowd this week, it's that the loop matters more than the model. People aren't bragging about a clever prompt anymore, they're bragging about leaving an agent running for 12 hours, or 20 rounds, or 10,000 training runs, and showing you the curve. The clearest example is a trader who wired up five parallel loops that scan, monitor, reflect and retune themselves on-chain, with the human nowhere in the inner loop. And the smartest people keep landing on the same punchline: the free autoresearch loop is a commodity, the verification harness is the moat. Here's what got built.
πŸ’‘#1
@noelclawfun
https://x.com/noelclawfun/status/2059191305612472492
Open-sourced an autonomous on-chain trading agent running five parallel continuous loops at once. A 5-minute scanner pulls 20 trending tokens, scores each 0-100 on dip depth, bounce, sentiment, buy-pressure and volume, runs a rug check and auto-buys. A 10-second monitor manages stop-loss, take-profit and trailing stops. A 5-minute heartbeat snapshots balances and only wakes the LLM on anomalies. A 90-minute loop lets the LLM switch trading mode, and a 4-hour reflect loop has it review closed trades and retune its own score thresholds, liquidity requirements and position sizing. Execution stays deterministic; the LLM only handles reflection, adaptation and exceptions.
πŸ’‘#2
@Av1dlive
https://x.com/Av1dlive/status/2059208104030671236
A detailed account of Claude Code's harness loop for long-horizon autonomous builds. He tracks how far each model can run unattended (Opus 3.7 about an hour, 4.6 around 12 hours) and names the three failure modes: context, planning, verification. The loop itself: an initializer turns a one-line prompt into persistent artifacts (a feature list as JSON, a progress file, a git repo, an init script, a completion flag), then each iteration starts in a fresh context window, reads progress, picks one unfinished feature, implements it, verifies with Puppeteer, and commits on pass. Opus 4.6 was good enough to drop the sprint-decomposition and per-session context-reset scaffolding.
πŸ’‘#3
@wandb
https://x.com/wandb/status/2059384575990939783
A crisp articulation of how overnight autoresearch loops actually get run in ML teams: Claude Code and Codex running 24/7, proposing experiments, kicking off training, monitoring results and staging the next iteration, with the experiment tracker serving as the queryable memory the loop compounds on. The key insight is that the loop needs durable, queryable state to keep improving across runs, not just a long prompt.
πŸ’‘#4
@hyprbots
https://x.com/hyprbots/status/2059177762003546555
Hyperbots says its own ML research stack is run by an autonomous multi-agent system that handles the full LLM/VLM lifecycle end to end: literature review, dataset analysis, infra setup, distributed training and monitoring, evaluation, failure analysis and reporting, all with persistent experiment memory. The architecture is one orchestrator plus seven specialized research-engineering subagents, each running continuous experiment loops with minimal human input, and they claim a 10-15x throughput gain in their finance-AI work.
πŸ’‘#5
@ChaseWang
https://x.com/ChaseWang/status/2059161913959788711
Ran 20 rounds of autoresearch on his own X archive to build a skill that drafts in his voice. The voice-fidelity score climbed from 8.53 to 9.97, and about 98% of that lift landed in rounds one through six. His takeaway, 'the spec is the residue, the protocol is the product,' is a clean statement of why the loop matters more than any single output. A rare non-ML, content-production application of autoresearch.
πŸ’‘#6
@kwindla
https://x.com/kwindla/status/2059300287689756962
Benchmarked the 1T-param Kimi K2.6 served by Cerebras at 650-1000 tokens/sec with about 150ms time-to-first-token for voice-agent use. On his 30-turn voice-agent benchmark, K2.6 with reasoning on ties GPT-5.1 and Haiku 4.5 while being about 200ms faster; on his primary task-agent benchmark it ranks #2 and finishes each agent-loop turn under 500ms, versus 3x+ slower competitors. The speed even lets the model emit structured data before plain text within a single turn.
πŸ’‘#7
@MEGAcodePaul
https://x.com/MEGAcodePaul/status/2059299925205373208
Describes MEGA / AgentOpt, a closed-loop optimizer for agent workflows. It reads your source code and surgically edits LLM-pipeline components (the tool at a node, a retry policy, individual prompts), logs every action, and tracks accuracy, latency and token usage at the same time, auto-reverting any candidate that pushes latency or cost past a threshold. It reports 76.55 versus a 52.67 baseline (and beats GEPA's 69.52) on a workflow-optimization benchmark aggregated across HotpotQA, IFBench, HoVer and PUPA.
πŸ’‘#8
@JoseCSancho
https://x.com/JoseCSancho/status/2059368252262830295
Lays out a vertical autoresearch playbook with cited proof points: fork karpathy/autoresearch, wire it to one clean numeric metric (cold-email reply rate, landing-page conversion, ROAS, Sharpe ratio), and sell it as done-for-you, where the eval harness, not the free loop, is the moat. His proof points are the quotable ones of the week: Shopify's 53%-faster templating from 93 automated commits, '$25 plus a single GPU equals 83 ML experiments overnight,' and operators moving cold-outreach reply rates from 2-4% to 8-12% in 4-6 weeks.
πŸ’‘#9
@Risanuria235755
https://x.com/Risanuria235755/status/2059138775100563680
Points to an autonomous-speedrunning archive where Claude Code and Codex race on the modded-nanogpt training speedrun: 10,000+ training runs, 600+ idea writeups, and a two-week burst of parallel autoresearch with a full record of what each agent tried, when, and whether it worked. It's a rare concrete artifact of large-scale parallel autoresearch you can actually go read, not just a claim.
πŸ’‘#10
@tarush_agarwal_
https://x.com/tarush_agarwal_/status/2059280644795203883
Cekura x ElevenLabs closes the loop on voice agents: when a voice agent fails in production, Cekura reproduces the failure in simulation, finds the root cause, improves the prompt and settings, and verifies the fix. It's the production-failure to simulated-repro to verified-fix loop applied to a domain where you can't just rerun against the real customer who hung up.
πŸ’‘#11
@dosco
https://x.com/dosco/status/2059338102230135198
Breaks down the research foundations behind aithy and maps each to a concrete mechanism: DSPy for declarative signatures and typed I/O with deterministic code handling parsing and routing; Recursive Language Models that inspect history with code and forward compact evidence between stages, treating context as inspectable external state; a paper on faulty memories warning to keep summaries grounded in raw transcripts and avoid repeated auto-consolidation; and 'Is Grep All You Need?' for grep-first hybrid retrieval. A genuinely technical reading list for agent-memory and autoresearch methodology.
πŸ’‘#12
@SwishMoe
https://x.com/SwishMoe/status/2059422896154374419
Built a legal AI around SimpleMem/EvolveMem and applied autoresearch not just to store memory but to improve how memory gets retrieved, via an evaluate, diagnose, propose, validate, repeat loop. The motivation is concrete legal failure modes: the model retrieves the wrong clause, misses an uploaded agreement, or loses earlier context. Optimizing retrieval itself, rather than dumping more into context, is the interesting move.
πŸ’‘#13
@jacob_dietle
https://x.com/jacob_dietle/status/2059422880254054810
Used a codemode MCP plus a factory pattern to wrap APIs whose off-the-shelf tooling is poor, like the HubSpot MCP. He runs an autoresearch-style loop that iterates on the prompt and the codemode together, with backpressure from a rubric, optimizing for performance first and only trimming length near the end of the loop to minimize the performance trade-off. He open-sourced the /eval-loop skill.
πŸ’‘#14
@mrluiscalderon
https://x.com/mrluiscalderon/status/2059333756113096879
Describes SkillForge v6.1: agent runs emit telemetry, and when the same gap shows up repeatedly the system proposes a new skill or a revision that an operator approves, after which every future agent inherits it. It's self-improving memory with a governance gate, which is the missing piece in most 'agent that edits itself' demos that just let the agent rewrite things unchecked.
πŸ’‘#15
@realbarnakiss
https://x.com/realbarnakiss/status/2059259121279418693
Tested Composer 2.5 (an RL-plus-LLM architecture) and reports a specific empirical failure: in autoresearch loops on ZK codebases, it hits an unavoidable RL regression around 40-50 iterations. Narrow, but it's exactly the kind of concrete, reproducible observation the autoresearch crowd needs more of, instead of vague claims that the loops just keep getting better forever.
πŸ“‘ Eco Products Radar
Eco Products Radar
Hermes Agent (Nous Research), the most-mentioned framework by far, the self-hosted loop-runner everyone benchmarks against. Claude Code and Codex, the two workhorses people leave running 24/7. karpathy/autoresearch, the MIT reference loop people keep forking. SkillOpt, the framework for treating skill files as trainable parameters. EVO, the open orchestrator that starts autoresearch on any repo in two commands. GEPA and DSPy, the optimization machinery the serious builders cite. Also surfacing: pi-autoresearch as a minimal reference implementation.
← Previous
Super User Daily: May 28, 2026
Next β†’
Ideas Radar: May 28, 2026
← Back to all articles

Comments

Loading...
>_