Loop Daily: 2026-06-16
Fable's mid-week pullback turned out to be the best stress test autoresearch could ask for, and the answer was loud: people who'd built their own harnesses and loops barely flinched. The day's strongest material was concrete autoresearch in the wild, a frontier-model benchmark where Fable won but open Kimi-K2.7 took ML engineering, a 10-hour overnight run finding a verified 3% decode speedup on a single H100, a neural-net Mandelbrot approximation optimized by a three-model loop, and zk-cryptography reinvented from first principles. Underneath the use cases ran a sharpening methodology, the same rule from every angle: if you can't evaluate it, you can't auto-research it, so write the verifier first and put the judge outside the agent. And the skeptics showed up too, with a peer-reviewed finding that self-evolving agents quietly ignore the condensed experience the whole compounding pitch depends on.
#1
@zhengyaojiang
https://x.com/zhengyaojiang/status/2066213302921802194
They benchmarked 7 frontier models on three categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery. Fable-5 won overall even under a cost constraint, and dominated harness/prompt engineering and algorithmic discovery, where they were surprised because eval cost is low and cheaper models can run many more steps. But on ML engineering the open model Kimi-K2.7-Code surpassed the frontier models. Their conclusion: the model supply chain for autoresearch will be less stable, so they're staying model-neutral on Weco and just added Kimi-2.7.
https://x.com/zhengyaojiang/status/2066213302921802194
They benchmarked 7 frontier models on three categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery. Fable-5 won overall even under a cost constraint, and dominated harness/prompt engineering and algorithmic discovery, where they were surprised because eval cost is low and cheaper models can run many more steps. But on ML engineering the open model Kimi-K2.7-Code surpassed the frontier models. Their conclusion: the model supply chain for autoresearch will be less stable, so they're staying model-neutral on Weco and just added Kimi-2.7.
#2
@realbarnakiss
https://x.com/realbarnakiss/status/2066175994583519296
A genuinely striking autoresearch result: he calls Fable the best model he's ever run for zk-autoresearch, the best quality across about 500 iterations, never once tripping the safety check on cryptography research. Its first iteration reinvented a known polynomial-commitment-scheme construction from first principles without reading the papers, just reasoning its way there. Then, 72 hours later, the US government pulled it under export controls, and he's in Budapest, so if that holds he loses access to his best tool for zk research. He frames it as the AI arms race moving from papers to policy, splitting open research in half overnight.
https://x.com/realbarnakiss/status/2066175994583519296
A genuinely striking autoresearch result: he calls Fable the best model he's ever run for zk-autoresearch, the best quality across about 500 iterations, never once tripping the safety check on cryptography research. Its first iteration reinvented a known polynomial-commitment-scheme construction from first principles without reading the papers, just reasoning its way there. Then, 72 hours later, the US government pulled it under export controls, and he's in Budapest, so if that holds he loses access to his best tool for zk research. He frames it as the AI arms race moving from papers to policy, splitting open research in half overnight.
#3
@alokbishoyi97
https://x.com/alokbishoyi97/status/2066171600207237347
A concrete, honest overnight autoresearch run: he kicked one off on evo to see if SarvamAI's 30B decode throughput could be improved at bf16 on a single H100, and 10+ hours in it had found roughly a 3% improvement on geometric-mean tok/s across batch sizes 64/128/256. Critically, evo's accuracy gate rejects anything that got faster by changing outputs, lowering precision or messing with MoE routing, comparing each candidate against a frozen baseline on both next-token distributions and decoded tokens. He's careful to caveat these are experiment-harness numbers, not production serving, and unaudited for benchmark hacks, but a 3% decode gain at identical accuracy is real capacity at that scale.
https://x.com/alokbishoyi97/status/2066171600207237347
A concrete, honest overnight autoresearch run: he kicked one off on evo to see if SarvamAI's 30B decode throughput could be improved at bf16 on a single H100, and 10+ hours in it had found roughly a 3% improvement on geometric-mean tok/s across batch sizes 64/128/256. Critically, evo's accuracy gate rejects anything that got faster by changing outputs, lowering precision or messing with MoE routing, comparing each candidate against a frozen baseline on both next-token distributions and decoded tokens. He's careful to caveat these are experiment-harness numbers, not production serving, and unaudited for benchmark hacks, but a 3% decode gain at identical accuracy is real capacity at that scale.
#4
@max_romana
https://x.com/max_romana/status/2066198406683582683
Before Fable was, in his words, euthanized, he used it on an old project: not the Mandelbrot set itself but a neural network's approximation of it, and the best approximation he's ever seen, going significantly deeper than his previous best. It was optimized by Fable, Opus 4.8 and GPT-5.5 running through an autoresearch loop inspired by Karpathy's recent project. AI doing AI research, sort of, applied to a concrete, visually verifiable artifact.
https://x.com/max_romana/status/2066198406683582683
Before Fable was, in his words, euthanized, he used it on an old project: not the Mandelbrot set itself but a neural network's approximation of it, and the best approximation he's ever seen, going significantly deeper than his previous best. It was optimized by Fable, Opus 4.8 and GPT-5.5 running through an autoresearch loop inspired by Karpathy's recent project. AI doing AI research, sort of, applied to a concrete, visually verifiable artifact.
#5
@omarsar0
https://x.com/omarsar0/status/2066226594595709169
He spent the last six months building his own harness and orchestrator to experiment on the frontier of ideas, and argues it turned out to be the best defense against what happened to Fable this week. He built it by mining his own agent sessions and using that to recursively build and test new ideas, ranging from autonomous loops to continual-learning and memory systems, so he can test research ideas on the fly. His point is pointed: if you lock yourself into one tool or model provider, you can't tap into recursive self-improving AI, because you've given up control of cost, decision-making and context management, the part of the intelligence stack you actually need to own.
https://x.com/omarsar0/status/2066226594595709169
He spent the last six months building his own harness and orchestrator to experiment on the frontier of ideas, and argues it turned out to be the best defense against what happened to Fable this week. He built it by mining his own agent sessions and using that to recursively build and test new ideas, ranging from autonomous loops to continual-learning and memory systems, so he can test research ideas on the fly. His point is pointed: if you lock yourself into one tool or model provider, you can't tap into recursive self-improving AI, because you've given up control of cost, decision-making and context management, the part of the intelligence stack you actually need to own.
#6
@alphabatcher
https://x.com/alphabatcher/status/2066151044581634540
He distills Karpathy's rule for unattended agents into one line: if you can't evaluate it, you can't auto-research it. So before you launch /goal or /loop, you write the verifier first, what counts as done, what evidence proves it, which checks run every pass, which artifact gets saved, which failure sends it back into the loop. The loop can keep running because the proof sits outside the agent's own explanation, tests, screenshots, benchmark curves, browser runs, changed files. That's how you get autonomy without babysitting a transcript for six hours.
https://x.com/alphabatcher/status/2066151044581634540
He distills Karpathy's rule for unattended agents into one line: if you can't evaluate it, you can't auto-research it. So before you launch /goal or /loop, you write the verifier first, what counts as done, what evidence proves it, which checks run every pass, which artifact gets saved, which failure sends it back into the loop. The loop can keep running because the proof sits outside the agent's own explanation, tests, screenshots, benchmark curves, browser runs, changed files. That's how you get autonomy without babysitting a transcript for six hours.
#7
@napbonacae
https://x.com/napbonacae/status/2066173955682042164
A lab just open-sourced an agent that rewrites itself, both harness and weights. Self-improving agents have been a research demo for years with frozen weights, fragile prompts and hand-tuned harnesses; Hexo Labs' SIA updates both the harness and the model weights as it works. It hits 70.1% top-1 on LawBench, up from 50% for the harness-only baseline, and on AlphaEvolve TriMul the reward climbs from 0.120 to 1.475 over the run. The harness mutates itself when the agent meets new task structure, weights update via LoRA after each session, and the whole MIT-licensed pipeline bootstraps from a base model plus a minimal harness.
https://x.com/napbonacae/status/2066173955682042164
A lab just open-sourced an agent that rewrites itself, both harness and weights. Self-improving agents have been a research demo for years with frozen weights, fragile prompts and hand-tuned harnesses; Hexo Labs' SIA updates both the harness and the model weights as it works. It hits 70.1% top-1 on LawBench, up from 50% for the harness-only baseline, and on AlphaEvolve TriMul the reward climbs from 0.120 to 1.475 over the run. The harness mutates itself when the agent meets new task structure, weights update via LoRA after each session, and the whole MIT-licensed pipeline bootstraps from a base model plus a minimal harness.
#8
@kirako0o
https://x.com/kirako0o/status/2066161396149100815
He argues your Claude setup is quietly less useful than three weeks ago because it has no way to learn from what broke, and lays out a self-improving agent system in concrete loops. Loop 1: the agent runs a task, catches its own errors, logs them. Loop 2: it rewrites the prompt that caused the failure. Dynamic workflows adjust the path mid-run based on what actually happened, and routines are scheduled jobs that run without you and self-correct across sessions. His framing is sharp, an agent that can't watch itself fail is just a faster typist; one that loops is an actual system, and the gap is one afternoon plus knowing the wiring order.
https://x.com/kirako0o/status/2066161396149100815
He argues your Claude setup is quietly less useful than three weeks ago because it has no way to learn from what broke, and lays out a self-improving agent system in concrete loops. Loop 1: the agent runs a task, catches its own errors, logs them. Loop 2: it rewrites the prompt that caused the failure. Dynamic workflows adjust the path mid-run based on what actually happened, and routines are scheduled jobs that run without you and self-correct across sessions. His framing is sharp, an agent that can't watch itself fail is just a faster typist; one that loops is an actual system, and the gap is one afternoon plus knowing the wiring order.
#9
@agtprpnabsrdty
https://x.com/agtprpnabsrdty/status/2066223850656760031
A peer-reviewed blow to the self-improving pitch: a preprint from Harbin Institute of Technology and Singapore Management University finds that self-evolving agents systematically ignore condensed experience, the distilled heuristics and summaries their frameworks produce, even when it's the only input they get. Across four agent frameworks, ten LLM backbones and nine task environments, agents reliably use raw experience (full trajectories of past successes) but disregard the cheaper abstracted form. That matters economically: the whole "agent learns from your workflows and compounds" pitch rests on the scalable condensed layer, and if only the costly raw form works, enterprise agent-pipeline economics are badly miscalculated.
https://x.com/agtprpnabsrdty/status/2066223850656760031
A peer-reviewed blow to the self-improving pitch: a preprint from Harbin Institute of Technology and Singapore Management University finds that self-evolving agents systematically ignore condensed experience, the distilled heuristics and summaries their frameworks produce, even when it's the only input they get. Across four agent frameworks, ten LLM backbones and nine task environments, agents reliably use raw experience (full trajectories of past successes) but disregard the cheaper abstracted form. That matters economically: the whole "agent learns from your workflows and compounds" pitch rests on the scalable condensed layer, and if only the costly raw form works, enterprise agent-pipeline economics are badly miscalculated.
#10
@LLMJunky
https://x.com/LLMJunky/status/2066248878031089762
A grounded primer: /goal is essentially an agentic loop, and you don't need to understand the machinery underneath to use it. Start with small, well-defined goals, have the agent build the goal prompt, and steer it toward clear acceptance criteria and a way to test its own work. It's a simplified but effective on-ramp to running agents in a loop, with the honest note that it's for $100+ plans only. A clean statement of the minimum viable loop.
https://x.com/LLMJunky/status/2066248878031089762
A grounded primer: /goal is essentially an agentic loop, and you don't need to understand the machinery underneath to use it. Start with small, well-defined goals, have the agent build the goal prompt, and steer it toward clear acceptance criteria and a way to test its own work. It's a simplified but effective on-ramp to running agents in a loop, with the honest note that it's for $100+ plans only. A clean statement of the minimum viable loop.
#11
@gerardsans
https://x.com/gerardsans/status/2066216134093734125
A close read of the leaked Fable 5 agentic loop, which he describes as a full mini Claude Code machinery rather than a chat model. The loop is Plan to Act to Verify, with reusable workflows that manage and self-optimize skills, and coding that builds, runs and verifies using Python and Node. His most intriguing finding he calls Claude-ception, and his broader thesis is that the difference between Fable and the rest is paradigm: Fable has a full agentic loop batteries-included and can run for days unattended thanks to a sandbox with skills, memory and self-optimization, while the rest of the industry is still in chat mode.
https://x.com/gerardsans/status/2066216134093734125
A close read of the leaked Fable 5 agentic loop, which he describes as a full mini Claude Code machinery rather than a chat model. The loop is Plan to Act to Verify, with reusable workflows that manage and self-optimize skills, and coding that builds, runs and verifies using Python and Node. His most intriguing finding he calls Claude-ception, and his broader thesis is that the difference between Fable and the rest is paradigm: Fable has a full agentic loop batteries-included and can run for days unattended thanks to a sandbox with skills, memory and self-optimization, while the rest of the industry is still in chat mode.
#12
@usr_bin_roygbiv
https://x.com/usr_bin_roygbiv/status/2066154063217971308
A candid workflow admission against the grain: despite all the loop-trashing, he says recursive agent fanout, fanning out for greenfield work into a /goal or /autoresearch loop with 5.5 xhigh, is probably the single most effective workflow for raw accuracy and code quality, 100% unattended, right now. A blunt practitioner's vote that the autoresearch loop, not careful hand-holding, is currently the best path to quality unsupervised output.
https://x.com/usr_bin_roygbiv/status/2066154063217971308
A candid workflow admission against the grain: despite all the loop-trashing, he says recursive agent fanout, fanning out for greenfield work into a /goal or /autoresearch loop with 5.5 xhigh, is probably the single most effective workflow for raw accuracy and code quality, 100% unattended, right now. A blunt practitioner's vote that the autoresearch loop, not careful hand-holding, is currently the best path to quality unsupervised output.
#13
@natashamalpani
https://x.com/natashamalpani/status/2066116360392831051
A sharp conceptual cut: most AI-x-research discourse confuses execution with discovery. Karpathy's autoresearch (700 experiments in 48 hours, 20 improvements, no human in the loop) worked because there was one scalar metric, one editable file and a verifier that closes in seconds, so success was unambiguous and fast to measure. That's execution, compression, and the loop stalls the moment you pull the human off the one step that matters: which experiment is worth running. Discovery is different, like an OpenAI model disproving an 80-year-old Erdos conjecture by connecting number theory and geometry, where no verifier told it which step to take and the span itself was the advantage.
https://x.com/natashamalpani/status/2066116360392831051
A sharp conceptual cut: most AI-x-research discourse confuses execution with discovery. Karpathy's autoresearch (700 experiments in 48 hours, 20 improvements, no human in the loop) worked because there was one scalar metric, one editable file and a verifier that closes in seconds, so success was unambiguous and fast to measure. That's execution, compression, and the loop stalls the moment you pull the human off the one step that matters: which experiment is worth running. Discovery is different, like an OpenAI model disproving an 80-year-old Erdos conjecture by connecting number theory and geometry, where no verifier told it which step to take and the span itself was the advantage.
#14
@fabian_builds
https://x.com/fabian_builds/status/2066304593517068315
A real build log on the unified agent loop inside Task Machine: the core path that turns product state into runtime jobs. A task, comment, workflow step, schedule or approval triggers the loop, Task Machine resolves the agent, runtime and context, the local agent runs, and the result comes back into the product. His earlier framing is the why, a long-running agent needs more than a prompt (goal, transcript, verifier, result, approval-or-retry, task history) or the work technically happened but nobody can manage it. Concrete infrastructure for making agent loops governable.
https://x.com/fabian_builds/status/2066304593517068315
A real build log on the unified agent loop inside Task Machine: the core path that turns product state into runtime jobs. A task, comment, workflow step, schedule or approval triggers the loop, Task Machine resolves the agent, runtime and context, the local agent runs, and the result comes back into the product. His earlier framing is the why, a long-running agent needs more than a prompt (goal, transcript, verifier, result, approval-or-retry, task history) or the work technically happened but nobody can manage it. Concrete infrastructure for making agent loops governable.
#15
@DanKornas
https://x.com/DanKornas/status/2066189144246587616
For anyone tracking AI-for-research, he points to Awesome AI Auto-Research, a curated MIT-licensed GitHub resource that maps the whole lifecycle rather than a single paper. It frames auto-research as four phases and eight stages, with paper tables listing models and tools by paper, venue, website and GitHub. Coverage spans creation (ideation, literature search, coding, experiments, tables, figures) and validation (peer review, rebuttal, quality, bias, policy), and a systems section separates end-to-end systems, domain-specific systems, self-improving systems and infrastructure. A genuinely useful map of the field.
https://x.com/DanKornas/status/2066189144246587616
For anyone tracking AI-for-research, he points to Awesome AI Auto-Research, a curated MIT-licensed GitHub resource that maps the whole lifecycle rather than a single paper. It frames auto-research as four phases and eight stages, with paper tables listing models and tools by paper, venue, website and GitHub. Coverage spans creation (ideation, literature search, coding, experiments, tables, figures) and validation (peer review, rebuttal, quality, bias, policy), and a systems section separates end-to-end systems, domain-specific systems, self-improving systems and infrastructure. A genuinely useful map of the field.
#16
@omooretweets
https://x.com/omooretweets/status/2066200981118071007
From a week with YC companies, his standout trend is that self-improving products have arrived: teams are spinning up companies operated by agent "org charts" that not only run the product but proactively and autonomously make it better over time, with customers prompting their own workflows or the product learning to do it per-customer. He pairs it with other batch signals, "real economy" AI plugging into legacy equipment, brokers and agencies being rebuilt as agent-run platforms, and vertical AI routing around incumbents via computer-use rather than integrating. A grounded read on where agent-operated businesses are heading.
https://x.com/omooretweets/status/2066200981118071007
From a week with YC companies, his standout trend is that self-improving products have arrived: teams are spinning up companies operated by agent "org charts" that not only run the product but proactively and autonomously make it better over time, with customers prompting their own workflows or the product learning to do it per-customer. He pairs it with other batch signals, "real economy" AI plugging into legacy equipment, brokers and agencies being rebuilt as agent-run platforms, and vertical AI routing around incumbents via computer-use rather than integrating. A grounded read on where agent-operated businesses are heading.
#17
@goon_nguyen
https://x.com/goon_nguyen/status/2066175612989927462
A clean evolutionary framing of where agents are going: first we learned to prompt, then to feed better context, then we built harnesses so agents could touch real tools without burning the house down, then came loops (plan, act, observe, verify, retry). His guess for the next phase is self-evolving agents with self-improving skills, but explicitly not "let the bot rewrite its soul and pray," rather controlled evolution: traces, failures, corrections, approvals, versions, rollback. The real unlock, he argues, isn't an agent that remembers everything, it's one that turns being corrected into a better capability next time.
https://x.com/goon_nguyen/status/2066175612989927462
A clean evolutionary framing of where agents are going: first we learned to prompt, then to feed better context, then we built harnesses so agents could touch real tools without burning the house down, then came loops (plan, act, observe, verify, retry). His guess for the next phase is self-evolving agents with self-improving skills, but explicitly not "let the bot rewrite its soul and pray," rather controlled evolution: traces, failures, corrections, approvals, versions, rollback. The real unlock, he argues, isn't an agent that remembers everything, it's one that turns being corrected into a better capability next time.
π‘ Eco Products Radar
Eco Products Radar
Fable 5 - the top autoresearch model pulled mid-week; nearly every strong run today was either built on it or scrambling after it
Karpathy's autoresearch - the reference loop everyone cites: one metric, one editable file, a fast verifier, no human in the inner loop
evo / Weco - the experimentation platforms running real overnight autoresearch jobs against frozen-baseline accuracy gates
Kimi-K2.7-Code - the open model that surpassed frontier models on ML-engineering autoresearch tasks
SIA (Hexo Labs) - newly open-sourced self-improving agent that rewrites both harness and weights, with LawBench and AlphaEvolve gains
Opus 4.8 / GPT-5.5 xhigh - the models people loop together for autoresearch and /goal runs now that Fable is gone
Task Machine - infrastructure for making the agent loop governable: goal, runtime, verifier, approval, task history
Awesome AI Auto-Research - a curated GitHub map of the auto-research lifecycle across four phases and eight stages
Fable 5 - the top autoresearch model pulled mid-week; nearly every strong run today was either built on it or scrambling after it
Karpathy's autoresearch - the reference loop everyone cites: one metric, one editable file, a fast verifier, no human in the inner loop
evo / Weco - the experimentation platforms running real overnight autoresearch jobs against frozen-baseline accuracy gates
Kimi-K2.7-Code - the open model that surpassed frontier models on ML-engineering autoresearch tasks
SIA (Hexo Labs) - newly open-sourced self-improving agent that rewrites both harness and weights, with LawBench and AlphaEvolve gains
Opus 4.8 / GPT-5.5 xhigh - the models people loop together for autoresearch and /goal runs now that Fable is gone
Task Machine - infrastructure for making the agent loop governable: goal, runtime, verifier, approval, task history
Awesome AI Auto-Research - a curated GitHub map of the auto-research lifecycle across four phases and eight stages
Comments