May 11, 2026loop

Loop Daily: 2026-05-12

Two trend lines today. First, the cost of a self-improving loop just got operationally visible: one user reports cutting an 8-hour agent devops session to 3 hours and reducing token use by 76% (~300M tokens per plan) after wiring auto-research into his Hermes skill-creation step. Another team built a 6-stage end-to-end autonomous improvement cycle and is now visualizing 29 generations of a self-improving codebase in 3D. Second, agentic loops are spreading sideways — not just for coding. A six-thousand-dataset Texas civic-data agent, a cold-outreach pipeline that finds local businesses without websites and builds one, a personal life-manager stack with explicit security architecture saying do not do autonomous send. The pattern: when the loop is built on top of someone else's harness (Claude, Codex, Hermes), the differentiator is the orchestration above and the memory below.

💡#1

@thebizfixer
https://x.com/thebizfixer/status/2053622052352131378
Set up Hermes Agent with a custom PM/SysAdmin profile called Strati that uses TinyFish search/fetch endpoints for autoresearch during skill creation. Reports cutting devops sessions from 8 hours to 3 and reducing token use by 76% (around 300M tokens saved per plan execution). Workflow: plan in Cursor, hand off to Strati to harden via web search of best practices, then Hermes Kanban dispatches Cursor CLI workers in hierarchical supervisor pattern with triple QA review. Auto-research also triggers in post-run reconciliation to improve skills, plus a 13-dimension final QA swarm with TestSprite MCP for regression reports.

💡#2

@danielmarinq
https://x.com/danielmarinq/status/2053387052247548283
Nexus is building a closed-loop self-improving system aiming for full end-to-end software automation by year-end. Each autonomous cycle has six stages: pre-planning architecture, medium-level spec/test authoring, 1-2 hour agentic execution, full benchmark/test evaluation, retrospective with telemetry on the agent itself (cost, tokens, context-limit hits), and self-improvement reasoning over both execution and meta-metrics. Shared a 3D visualization showing 29 generations of a codebase evolving under this system.

💡#3

@gkisokay
https://x.com/gkisokay/status/2053467830155427942
Running Hermes agents 24/7 in a two-loop architecture: Auto-think continuously tracks both AI industry developments and the user's current project state, then a Research agent feeds Dreamer evidence to surface build ideas. Auto-build kicks in once an idea passes muster: Main writes the product plan, Coder/QA build and verify with tests and receipts at each step. After 7 days, a Retention agent decides whether to keep, improve, park, or archive the build.

💡#4

@aijoey
https://x.com/aijoey/status/2053540454340194485
Got Gemma 4 26B A4B uncensored running locally on a DGX Spark (GB10/Blackwell, 128GB unified memory) using NVFP4 quantization and vLLM with DFlash speculative decoding. Hits roughly 90 tok/s in smoke tests, which makes it usable for interactive agent loops on local hardware. Point made: local AI is no longer just downloading weights, it's owning the whole stack — model, quantization, kernels, serving, speculation, agent loop.

💡#5

@DavidOndrej1
https://x.com/DavidOndrej1/status/2053368314391343349
New version of AutoResearch designed to jailbreak any AI model through iterative prompt testing. The trick: highly problematic content sits in a hidden example.md file that the researcher model never sees, while the AI only iterates on the header and footer wrapped around it. This split-testing setup lets autoresearch run indefinitely without the main model balking at legality or morality issues.

💡#6

@cyrilXBT
https://x.com/cyrilXBT/status/2053418603756798341
Maps out the OpenClaw + Hermes + Paperclip open-source agent stack as employee + memory + company. OpenClaw executes (reads files, browses web, writes code, sends emails). Hermes evaluates each completed task and generates reusable skills, so the agent compounds across sessions. Paperclip orchestrates multiple OpenClaw agents in parallel, routing tasks and monitoring outputs. The three combine into a 24/7 self-improving operation with no human in the loop.

💡#7

@frog_omo
https://x.com/frog_omo/status/2053453751864963188
After a month of research, the only AI life-manager stack that actually works in May 2026: Telegram bot to self-hosted n8n on a $6 VPS, Claude Sonnet 4.5 via API, Gmail/Calendar/Todoist/Whisper. Cost $15-25/month. Specific findings: Zapier charges per step so a 6-step agent loop costs 6x; Make penalizes multi-step loops; Lindy at $49.99/mo has $550 surprise-charge incidents; n8n community edition is free with unlimited executions. Build plan: text triage first, then voice todos, then morning briefs. Hard rule: drafts only, never autonomous send — citing EchoLeak (CVE-2025-32711, CVSS 9.3) and Black Hat 2025 calendar-invite Gemini injections as why.

💡#8

@kinwo
https://x.com/kinwo/status/2053336431163302037
Built Ouroboros, a self-improving AI agent that reflects on completed work, writes evolution logs, checkpoints memory, and eventually crystallizes repeated patterns into new Agent Skills. Open-sourced for feedback. Sits in the same lineage as Hermes-style self-improving loops but is a personal experimental build rather than a framework launch.

💡#9

@chenzeling4
https://x.com/chenzeling4/status/2053610953703350403
Pi-autoresearch is an autonomous experiment loop for AI coding agents inspired by karpathy/autoresearch — try ideas, benchmark, keep improvements, revert regressions. Measures test speed, bundle size, build times, Lighthouse scores. Comes with a live dashboard. Already at 6,533 stars.

💡#10

@jravinder
https://x.com/jravinder/status/2053506289469214729
Built TXLookup at AITX × Codex hackathon: an agent loop over 6,061 Texas civic datasets. User asks a question in English, the agent picks the right portal, queries it, and cites the source. Solves the gap between public civic data existing and being reachable for normal humans. Practical example of agent loops applied to public-interest data discovery rather than coding.

💡#11

@EvasionLV6
https://x.com/EvasionLV6/status/2053459039661568048
Compounding agent experiment: a workload that initially required 12 agents to complete on first try was whittled down to 1 agent after 3 self-improvement passes. The system keeps consolidating skill, so what took an agent swarm collapses into a single more capable agent. Brief but concrete data point on what self-improving compounding looks like in practice.

💡#12

@Wilkont
https://x.com/Wilkont/status/2053519872563245327
Launched SIP (Self-Improving Prompt Protocol), an agent-agnostic open-source layer that turns weak user input into structured, safe, tool-aware agent instructions before execution. Adds context, tools, constraints, safety rules, success criteria, verification steps, and output format. Already integrated in Cairo and pitched as compatible with Codex, Claude Code, OpenClaw, browser/coding agents. Argument: future of agents isn't just better models but agents that improve the prompt before executing.

💡#13

@usr_bin_roygbiv
https://x.com/usr_bin_roygbiv/status/2053307685970276437
Short but pointed claim: setting up autoresearch loops for your specific harness/model combo on evals is the single highest-alpha thing you can do right now. Implication is that the lift comes not from better base models but from running automated eval-driven improvement loops tied to your exact stack. Short tweet, high signal-to-noise.

💡#14

@Ghost_gi_m
https://x.com/Ghost_gi_m/status/2053506076809601303
Shipped ghostloop v1.0.0, positioned as the missing runtime layer between ROS 2 and VLA models. Includes a tool-using agent loop, a fail-closed safety pipeline, and sim-first execution for embodied AI. Demo: Claude Desktop driving a Franka Panda arm through a geofenced safety pipeline via MCP. One of the few cases of agent loops applied to physical robotics rather than text/code.

💡#15

@eric_m_freeman
https://x.com/eric_m_freeman/status/2053535908331241520
Two papers analyzed: one shows local LLM agents in an agent loop with RAG, structured prompts, history compression, and reflective analysis exploited 83% of Linux priv-esc vulnerabilities (Llama3.1 70B), and smaller 8B/7B local models hit 67% with guidance. Second paper proposes activation-level latent detection that watches the agent's trajectory over time — adversarial restlessness in the residual stream — hitting 93.8% detection on synthetic and 89.4% with 2.4% FPR on mixed. Argues guardrails as front-door text filters are theater for agentic systems; defense has to watch behavior trajectory and internal state.

💡#16

@akshay_pachaar
https://x.com/akshay_pachaar/status/2053480693733433797
Maps Claude Code's architecture as six layers around a deliberately simple master agent loop (perception-action-observation). Notable details: 3-layer context compressor with 92% threshold, prompt cache at 10% cost for stable prefixes, FSM protocol (IDLE-REQUEST-WAIT-RESPOND) for subagent mailboxes over Redis pub/sub, autonomous board with atomic locks, per-task worktree isolation with merge conflict detection. The framing: it's not a smart loop, it's a dumb loop with a smart harness mediating around it.

💡#17

@TechAIDailyNews
https://x.com/TechAIDailyNews/status/2053479249223520630
Anthropic rolled out 'dreaming' as a research preview: agents review past behavior between sessions, spot patterns, and self-improve for long-running workflows without constant retraining. The cited tip: pair dreaming with rubric-based evaluation for coding and finance agents to cut drift by over 3x per Anthropic's tests. Echoes Karpathy's framing of routing around model deficits as the next discipline.

💡#18

@burkov
https://x.com/burkov/status/2053269138580140320
Revisits the 2023 ICLR ReAct paper from Google Research as the lineage of all modern agentic AI. Key finding: interleaving thought and action with just a few human-written exemplars dropped Wikipedia QA hallucination from 56% (chain-of-thought) to roughly zero, and beat imitation-learning/RL baselines on household sim and online shopping by 34 and 10 points respectively despite using thousands of times less training. Three and a half years later, the thought-action-observation loop is still the basic shape every tool-using assistant runs.

💡#19

@overfitted_
https://x.com/overfitted_/status/2053436803097436372
Pointed out that Anthropic's MilesDeutscher portfolio prompt demo collapses an entire CFA workflow into a single agent loop with live data pulls. Framing: Claude Code is no longer just an IDE; this is the services-as-software wedge where every run feeds the memory loop and Anthropic captures both the seat and the tokens. Calls out that GPT-5.5 still doesn't have the orchestrator layer to match.

💡#20

@yzg75001
https://x.com/yzg75001/status/2053461095189487883
Comment on GPT-Realtime-2 bootstrapping its own MCP tools on the fly — when the agent can't do something, it doesn't just fail, it creates the capability for next time. Positioned as the actual unlock for autonomous systems and a concrete instance of self-improving agent architecture rather than a vague promise.

💡#21

@UserJourneys
https://x.com/UserJourneys/status/2053444659116953869
Roundup of what 'frontier' AI conversations look like right now: Anthropic's dreaming self-improving agents reviewing overnight actions, Claude Managed Agents coordinating sub-agents over hours-to-days runs with browser/computer tools and rubric-based self-correction, Claude Opus 4.7 and Grok 5 with 1M+ context and stronger autonomous computer use including 32-step cyber-attack sim cleared in one go, and GPT-Realtime-2 style live multilingual meeting agents.

💡#22

@Avicula11
https://x.com/Avicula11/status/2053360210694332481
Built a help-desk AI on TypeScript/Hono/Next.js 15 with Postgres+pgvector, Redis, and Gemini API. Tool-calling agent loop with RAG and citations, multi-tenant isolation (separate prompt, knowledge, token budget per tenant), streaming SSE, admin dashboard with usage, conversations, and prompt versioning. Author's takeaway: the agent loop itself is about 200 lines, the infrastructure around it is everything else, and the hardest part wasn't making the LLM talk but making it reliable.

💡#23

@rohan0673
https://x.com/rohan0673/status/2053370412201853075
Built an agentic loop product that finds local businesses without a website, builds them a site, and sends them a cold pitch. Any Claude subscriber can plug it in and run it. Concrete example of using an agent loop as the engine for a productized service-business workflow (lead-gen + asset creation + outreach) rather than internal coding automation.

💡#24

@DoDataThings
https://x.com/DoDataThings/status/2053479923793461358
Argues agent-loop mode beats chat mode for anything ongoing. Specific pattern: Claude Code on cron reading state files between fires turns one chat into a runtime, and the state file becomes what you iterate on instead of the prompt. Concrete mental shift from prompt engineering to state engineering in long-running agent workflows.

💡#25

@Twendee_
https://x.com/Twendee_/status/2053488568342073544
Quick datapoint: AI agents ran overnight and surfaced 72 peer-reviewed papers. Open-source tools like Autoresearch let you compose agent teams through conversation without coding. Short but a concrete non-coding research-loop use case worth tracking.

💡#26

@aerentensora
https://x.com/aerentensora/status/2053404727488970872
Startup idea: agent doing autoresearch with the physical world — collecting data, building its own sensors when needed, building corresponding math models in Lean, and minimizing model-reality loss. Brief but interesting as a non-coding direction for autoresearch loops pointed at empirical science with formal proofs.

💡#27

@BruceMi0321
https://x.com/BruceMi0321/status/2053488272366833715
Detailed walkthrough of how they sandbox their AI: executions blocked entirely (AI can read/write/manage but not run programs), project-folder-scoped permissions on mainframe, removed approvals to allow agentic loop iteration on compilation errors. Notes the approval pattern works through Telegram (yes/no per execution), with claw software disabling further execution until next approval. Read access through tools for broader project visibility without enabling modification or deletion outside scope.

📡 Eco Products Radar

Eco Products Radar
Hermes — 9 mentions
Claude Code — 9 mentions
OpenClaw — 7 mentions
Autoresearch — 7 mentions
Codex — 5 mentions
MCP — 4 mentions
Telegram — 4 mentions
Cursor — 3 mentions
Paperclip — 3 mentions
n8n — 3 mentions

← Previous

Super User Daily: 2026-05-12

Ideas Radar: 2026-05-12

← Back to all articles

Loop Daily: 2026-05-12

More Articles

Comments