Loop Daily: 2026-04-02
The autoresearch ecosystem had a productive day on March 31st. The biggest theme was expansion beyond pure code optimization. People are pointing autoresearch loops at KV cache compression, firewall rulesets, Apple Neural Engine inference, RL fine-tuning, and even philosophical divergence testing across frontier models. Meanwhile the Claude Code source leak dominated the agent loop conversation, revealing surprisingly simple core architecture wrapped in layers of context management and permission logic. The real signal: the barrier to running recursive improvement loops keeps dropping, and the use cases keep getting weirder.
#1
@shannholmberg
https://x.com/shannholmberg/status/2038866414057161145
AutoReason extends autoresearch to domains where there is no numeric metric to optimize. Instead of hill-climbing a score, it runs a loop of four agents: one writes, one critiques without fixing, a third rewrites based on that critique, and a fourth merges the best of both versions. A blind judge panel picks the winner, and the loop continues until nothing beats the current version. Every agent gets fresh context to prevent confirmation bias from building up. In blind panel testing, AutoReason scored 35 out of 35. The next best method scored 21. This is probably the most interesting conceptual extension of autoresearch this week because it cracks the "but my problem doesn't have a number" objection wide open.
https://x.com/shannholmberg/status/2038866414057161145
AutoReason extends autoresearch to domains where there is no numeric metric to optimize. Instead of hill-climbing a score, it runs a loop of four agents: one writes, one critiques without fixing, a third rewrites based on that critique, and a fourth merges the best of both versions. A blind judge panel picks the winner, and the loop continues until nothing beats the current version. Every agent gets fresh context to prevent confirmation bias from building up. In blind panel testing, AutoReason scored 35 out of 35. The next best method scored 21. This is probably the most interesting conceptual extension of autoresearch this week because it cracks the "but my problem doesn't have a number" objection wide open.
#2
@Hevalon
https://x.com/Hevalon/status/2038977575372951930
Built autoresearch-rl and pointed it at GRPO fine-tuning on Basilic AI A100s. One command, 15 iterations, zero human intervention, 100% infrastructure success rate. GSM8K pass@1 went from 26% baseline to 36%. The hard part, according to Hevalon, was not the search algorithm but the infrastructure. Each RL iteration needs a clean CUDA environment and 5-10 minutes of A100 time. A bad hyperparameter wastes an hour. This is the first serious attempt at making autoresearch work for RL fine-tuning, and the infrastructure challenge is real. You cannot just run this on a workstation.
https://x.com/Hevalon/status/2038977575372951930
Built autoresearch-rl and pointed it at GRPO fine-tuning on Basilic AI A100s. One command, 15 iterations, zero human intervention, 100% infrastructure success rate. GSM8K pass@1 went from 26% baseline to 36%. The hard part, according to Hevalon, was not the search algorithm but the infrastructure. Each RL iteration needs a clean CUDA environment and 5-10 minutes of A100 time. A bad hyperparameter wastes an hour. This is the first serious attempt at making autoresearch work for RL fine-tuning, and the infrastructure challenge is real. You cannot just run this on a workstation.
#3
@dhawalc
https://x.com/dhawalc/status/2038889125814903204
Ran autoresearch overnight on TurboQuant KV cache compression and hit 5.1x compression with FP16-matching generation quality. No quality loss. The winning stack: 3-bit keys, 1-bit residual signs, 2-bit values, plus an FP16 window for recent tokens. Started at 0/5 accuracy (completely broken). Six rounds later it matched FP16 on every question. The breakthrough came from finding a one-line protocol bug in get_mask_sizes that had invalidated every previous test. Once fixed, the algorithm evolution actually worked. The TurboQuant paper promised 5x. Autoresearch delivered 5.1x.
https://x.com/dhawalc/status/2038889125814903204
Ran autoresearch overnight on TurboQuant KV cache compression and hit 5.1x compression with FP16-matching generation quality. No quality loss. The winning stack: 3-bit keys, 1-bit residual signs, 2-bit values, plus an FP16 window for recent tokens. Started at 0/5 accuracy (completely broken). Six rounds later it matched FP16 on every question. The breakthrough came from finding a one-line protocol bug in get_mask_sizes that had invalidated every previous test. Once fixed, the algorithm evolution actually worked. The TurboQuant paper promised 5x. Autoresearch delivered 5.1x.
#4
@advait_jayant
https://x.com/advait_jayant/status/2039022806243979696
Divergence Explorer takes autoresearch in a completely different direction. Instead of optimizing a metric, it investigates whether frontier AI models actually think differently or have all converged. An autonomous researcher on OpenGradient generated 320 hard questions across ethics, consciousness, philosophy, and geopolitics, then sent each to GPT-5.2, Claude Opus 4.6, Gemini 2.5 Flash, and Grok 4 in parallel. Every response was cryptographically sealed in a TEE enclave. The finding: open-ended questions produced 95% consensus. But when the system learned on its own to force binary choices, consensus dropped to 33%. No human switched modes. The researcher figured out autonomously that models only agree when you let them hedge.
https://x.com/advait_jayant/status/2039022806243979696
Divergence Explorer takes autoresearch in a completely different direction. Instead of optimizing a metric, it investigates whether frontier AI models actually think differently or have all converged. An autonomous researcher on OpenGradient generated 320 hard questions across ethics, consciousness, philosophy, and geopolitics, then sent each to GPT-5.2, Claude Opus 4.6, Gemini 2.5 Flash, and Grok 4 in parallel. Every response was cryptographically sealed in a TEE enclave. The finding: open-ended questions produced 95% consensus. But when the system learned on its own to force binary choices, consensus dropped to 33%. No human switched modes. The researcher figured out autonomously that models only agree when you let them hedge.
#5
@ryanmart3n
https://x.com/ryanmart3n/status/2039016764982669435
Lays out the missing piece for full autonomous AI development. Autoresearch handles agent-building-models. Meta-Harness handles agents-building-agents. But we still need Meta-Measure or autobenchmark: agents-building-evaluations. Ryan's team has started automating the task discrimination part of benchmark development for Terminal-Bench 3.0, but says the real unlock is cracking task generation. The taxonomy is clean: once you can autonomously generate evals, the entire build-test-improve cycle closes without human intervention.
https://x.com/ryanmart3n/status/2039016764982669435
Lays out the missing piece for full autonomous AI development. Autoresearch handles agent-building-models. Meta-Harness handles agents-building-agents. But we still need Meta-Measure or autobenchmark: agents-building-evaluations. Ryan's team has started automating the task discrimination part of benchmark development for Terminal-Bench 3.0, but says the real unlock is cracking task generation. The taxonomy is clean: once you can autonomously generate evals, the entire build-test-improve cycle closes without human intervention.
#6
@duanebester
https://x.com/duanebester/status/2038825090792480800
Built an agent in Zig that autonomously optimizes GPU kernels, dispatch strategies, and pipeline architecture for Apple Silicon. It edits Metal shaders, benchmarks, and commits its own improvements. MNIST throughput: 407k images per second versus 82k for MLX (5x slower) and 12k for PyTorch MPS (34x slower). All three hit the same 98% accuracy. The key optimizations the agent discovered: batched command buffer encoding at 64 batches per command, validation frequency reduction to every 5th epoch, and a GPU argmax kernel for batched evaluation. The agent records summaries of failed experiments so it never repeats dead ends. That memory detail is what separates toy loops from loops that actually compound.
https://x.com/duanebester/status/2038825090792480800
Built an agent in Zig that autonomously optimizes GPU kernels, dispatch strategies, and pipeline architecture for Apple Silicon. It edits Metal shaders, benchmarks, and commits its own improvements. MNIST throughput: 407k images per second versus 82k for MLX (5x slower) and 12k for PyTorch MPS (34x slower). All three hit the same 98% accuracy. The key optimizations the agent discovered: batched command buffer encoding at 64 batches per command, validation frequency reduction to every 5th epoch, and a GPU argmax kernel for batched evaluation. The agent records summaries of failed experiments so it never repeats dead ends. That memory detail is what separates toy loops from loops that actually compound.
#7
@duanebester
https://x.com/duanebester/status/2039031530232696878
The nnzap inference benchmark follow-up pushes the results further: 1.58 million inferences per second GPU batched, 9.3x faster than MLX, 3.2x faster than PyTorch. Single-sample CPU latency dropped from 884 microseconds to 21 microseconds. All in Zig with zero-copy unified memory, comptime layouts, and Metal compute shaders. An autonomous agent ran 19 experiments, reading assembly, writing Metal kernels, and rolling back its own failures.
https://x.com/duanebester/status/2039031530232696878
The nnzap inference benchmark follow-up pushes the results further: 1.58 million inferences per second GPU batched, 9.3x faster than MLX, 3.2x faster than PyTorch. Single-sample CPU latency dropped from 884 microseconds to 21 microseconds. All in Zig with zero-copy unified memory, comptime layouts, and Metal compute shaders. An autonomous agent ran 19 experiments, reading assembly, writing Metal kernels, and rolling back its own failures.
#8
@detectiveomee
https://x.com/detectiveomee/status/2039020976214557114
Applied autoresearch to firewall optimization with mathematical correctness proofs. The system has an LLM propose small changes (swap rules, remove shadowed rules, reorder), a BDD equivalence checker proves the new ruleset behaves identically, and a traffic-aware scorer measures performance. Classical algorithms first took 613 rules down to 190 (69% reduction in 6 seconds). Then the LLM found 45 cross-action shadows the classical code missed and achieved a further 3.5% weighted match depth reduction through traffic-aware reordering. The honest assessment: 3.5% probably is not worth 10 minutes of Opus API calls in production. But a verified optimization loop where every change carries a proof of correctness is a genuinely new pattern worth exploring.
https://x.com/detectiveomee/status/2039020976214557114
Applied autoresearch to firewall optimization with mathematical correctness proofs. The system has an LLM propose small changes (swap rules, remove shadowed rules, reorder), a BDD equivalence checker proves the new ruleset behaves identically, and a traffic-aware scorer measures performance. Classical algorithms first took 613 rules down to 190 (69% reduction in 6 seconds). Then the LLM found 45 cross-action shadows the classical code missed and achieved a further 3.5% weighted match depth reduction through traffic-aware reordering. The honest assessment: 3.5% probably is not worth 10 minutes of Opus API calls in production. But a verified optimization loop where every change carries a proof of correctness is a genuinely new pattern worth exploring.
#9
@christinetyip
https://x.com/christinetyip/status/2039040420286521693
Applied autoresearch beyond research problems to real-world system optimization. The result: up to 6.31x faster inference than Apple's official CoreML approach on the Apple Neural Engine. This is notable because it takes the loop out of the "optimize my model training" lane and into production inference infrastructure. When your autoresearch loop is finding 6x speedups over vendor-official implementations, the vendor should probably be running these loops too.
https://x.com/christinetyip/status/2039040420286521693
Applied autoresearch beyond research problems to real-world system optimization. The result: up to 6.31x faster inference than Apple's official CoreML approach on the Apple Neural Engine. This is notable because it takes the loop out of the "optimize my model training" lane and into production inference infrastructure. When your autoresearch loop is finding 6x speedups over vendor-official implementations, the vendor should probably be running these loops too.
#10
@DeepValueBagger
https://x.com/DeepValueBagger/status/2038788897128198210
Built a stock research-first agentic AI platform from scratch after getting frustrated with OpenClaw and existing deep research tools. The core realization: major platforms have stale data, OpenClaw is bloated with features nobody needs, and multi-agentic handling confuses which agent is which half the time. So he rolled his own agent with an agentic loop, memory, first-class tools for equity research including free search, an agentic browser, direct SEC data access, a price database, RAG indexing, and a pipeline to ingest all transcripts and MD&A filings. Once the data platform was built, vibe coding the UI was easy.
https://x.com/DeepValueBagger/status/2038788897128198210
Built a stock research-first agentic AI platform from scratch after getting frustrated with OpenClaw and existing deep research tools. The core realization: major platforms have stale data, OpenClaw is bloated with features nobody needs, and multi-agentic handling confuses which agent is which half the time. So he rolled his own agent with an agentic loop, memory, first-class tools for equity research including free search, an agentic browser, direct SEC data access, a price database, RAG indexing, and a pipeline to ingest all transcripts and MD&A filings. Once the data platform was built, vibe coding the UI was easy.
#11
@koylanai
https://x.com/koylanai/status/2039027239304433767
Breaks down the Shopify case study on going from one-shot LLM to specialized sub-agents with DSPy and MIPRO. The key finding that matters for autoresearch: smaller model plus better architecture beats bigger model plus worse architecture. Self-hosted Qwen 3 outperformed GPT-5 in their pipeline. Prompt optimization with MIPRO did not work on the monolithic single-agent setup but worked extremely well on specialized sub-agents. The architecture determines the optimization surface. Clean modules with isolated objectives give optimizers clean signal to work with. This is directly applicable to anyone designing autoresearch loops: optimize after you architect, not before.
https://x.com/koylanai/status/2039027239304433767
Breaks down the Shopify case study on going from one-shot LLM to specialized sub-agents with DSPy and MIPRO. The key finding that matters for autoresearch: smaller model plus better architecture beats bigger model plus worse architecture. Self-hosted Qwen 3 outperformed GPT-5 in their pipeline. Prompt optimization with MIPRO did not work on the monolithic single-agent setup but worked extremely well on specialized sub-agents. The architecture determines the optimization surface. Clean modules with isolated objectives give optimizers clean signal to work with. This is directly applicable to anyone designing autoresearch loops: optimize after you architect, not before.
#12
@DanielMiessler
https://x.com/DanielMiessler/status/2039067001243771117
Raises a provocative early-stage idea: cryptocurrency may be at risk from autoresearch being applied to quantum encryption breakthroughs or attacks on existing algorithms. The broader claim is that autoresearch is about to reveal how primitively we are doing everything. Coming from a well-known security researcher, this is worth tracking as a signal of how seriously the infosec community is taking recursive optimization loops.
https://x.com/DanielMiessler/status/2039067001243771117
Raises a provocative early-stage idea: cryptocurrency may be at risk from autoresearch being applied to quantum encryption breakthroughs or attacks on existing algorithms. The broader claim is that autoresearch is about to reveal how primitively we are doing everything. Coming from a well-known security researcher, this is worth tracking as a signal of how seriously the infosec community is taking recursive optimization loops.
#13
@OpenBMB
https://x.com/OpenBMB/status/2038987177946980819
Tsinghua NLP published CPMobius, a Coach-Player paradigm that lets LLMs self-evolve reasoning capabilities in a completely data-free environment. The Coach generates math tasks, the Player learns to solve them via majority voting and self-training. The Coach uses dynamic difficulty filtering to keep tasks in a 20-80% success rate sweet spot. Progress-driven rewards mean the Coach only wins when the Player actually gets smarter. On Qwen2.5-Math-7B, CPMobius boosted overall accuracy by 4.9 points and out-of-distribution accuracy by 5.4 points. This is autoresearch applied to the training pipeline itself: zero external data, fully autonomous, and beating existing unsupervised methods.
https://x.com/OpenBMB/status/2038987177946980819
Tsinghua NLP published CPMobius, a Coach-Player paradigm that lets LLMs self-evolve reasoning capabilities in a completely data-free environment. The Coach generates math tasks, the Player learns to solve them via majority voting and self-training. The Coach uses dynamic difficulty filtering to keep tasks in a 20-80% success rate sweet spot. Progress-driven rewards mean the Coach only wins when the Player actually gets smarter. On Qwen2.5-Math-7B, CPMobius boosted overall accuracy by 4.9 points and out-of-distribution accuracy by 5.4 points. This is autoresearch applied to the training pipeline itself: zero external data, fully autonomous, and beating existing unsupervised methods.
#14
@JIACHENLIU8
https://x.com/JIACHENLIU8/status/2038833454083883237
Makes a clean distinction that is easy to miss: autoresearch is about the research pipeline, while meta-harness is about optimizing the scaffold end-to-end. Meta-harness is harder because credit assignment spans code history and all prior traces. The question that matters: what happens when the evaluator is confidently wrong about which change caused a gain? This is the credit assignment problem for agent self-improvement, and nobody has solved it yet.
https://x.com/JIACHENLIU8/status/2038833454083883237
Makes a clean distinction that is easy to miss: autoresearch is about the research pipeline, while meta-harness is about optimizing the scaffold end-to-end. Meta-harness is harder because credit assignment spans code history and all prior traces. The question that matters: what happens when the evaluator is confidently wrong about which change caused a gain? This is the credit assignment problem for agent self-improvement, and nobody has solved it yet.
#15
@NandinoAI
https://x.com/NandinoAI/status/2038818569413156979
A sharp critique of self-improving coding tools. The improvement metric is usually "generates code that passes its own tests," which is a closed feedback loop. The agent gets better at satisfying itself, not at satisfying users. The real breakthrough, NandinoAI argues, happens when the improvement signal comes from production metrics rather than the model grading its own output. This is the fundamental validity question for any autoresearch setup: is your metric actually measuring what you care about?
https://x.com/NandinoAI/status/2038818569413156979
A sharp critique of self-improving coding tools. The improvement metric is usually "generates code that passes its own tests," which is a closed feedback loop. The agent gets better at satisfying itself, not at satisfying users. The real breakthrough, NandinoAI argues, happens when the improvement signal comes from production metrics rather than the model grading its own output. This is the fundamental validity question for any autoresearch setup: is your metric actually measuring what you care about?
#16
@drewsky1
https://x.com/drewsky1/status/2039007552243896476
Uses separate profiles for different autoresearch projects running overnight on a 3090. One feeds papers and specs into optimization loops for hardware design. Another manages a product prototype with API integration, search query refinement, and compliance checks against platform TOS. Both use custom skills written with Opus. This is a practical example of autoresearch being used for non-code applications: hardware design optimization and product development workflows running 24/7.
https://x.com/drewsky1/status/2039007552243896476
Uses separate profiles for different autoresearch projects running overnight on a 3090. One feeds papers and specs into optimization loops for hardware design. Another manages a product prototype with API integration, search query refinement, and compliance checks against platform TOS. Both use custom skills written with Opus. This is a practical example of autoresearch being used for non-code applications: hardware design optimization and product development workflows running 24/7.
#17
@jaredgoering
https://x.com/jaredgoering/status/2039019004510056737
Pointed an open-source ML search engine at Parkinson's disease detection from smartphone tapping data. Ran 85 experiments in under 2 minutes, at least 10x faster than standard autoresearch. Beat the best published confound-controlled result by 8.8 percentage points. Medical AI research is exactly the kind of domain where autoresearch could have outsized impact because the metric (diagnostic accuracy) is well-defined and the search space of model architectures and features is enormous.
https://x.com/jaredgoering/status/2039019004510056737
Pointed an open-source ML search engine at Parkinson's disease detection from smartphone tapping data. Ran 85 experiments in under 2 minutes, at least 10x faster than standard autoresearch. Beat the best published confound-controlled result by 8.8 percentage points. Medical AI research is exactly the kind of domain where autoresearch could have outsized impact because the metric (diagnostic accuracy) is well-defined and the search space of model architectures and features is enormous.
#18
@vin_asia
https://x.com/vin_asia/status/2038979358648639537
Running autoresearch loops 24 hours a day on old GPUs, continuously optimizing code, SQL queries, and business procedures, all unattended. Speed in tokens per second becomes irrelevant when you are running overnight loops. This is the "quiet deployment" pattern: people are already running autoresearch as always-on infrastructure for business process optimization, not just research experiments.
https://x.com/vin_asia/status/2038979358648639537
Running autoresearch loops 24 hours a day on old GPUs, continuously optimizing code, SQL queries, and business procedures, all unattended. Speed in tokens per second becomes irrelevant when you are running overnight loops. This is the "quiet deployment" pattern: people are already running autoresearch as always-on infrastructure for business process optimization, not just research experiments.
π‘ Eco Products Radar
Eco Products Radar
#19
@PrimeIntellect
https://x.com/PrimeIntellect/status/2038787571795599549
Highlights Flywheel by Paradigm AI as one of the excellent projects emerging in the spirit of auto-research. PrimeIntellect is tracking the ecosystem of tools being built on top of the autoresearch concept, and Flywheel appears to be gaining traction as a more structured framework for running optimization loops.
https://x.com/PrimeIntellect/status/2038787571795599549
Highlights Flywheel by Paradigm AI as one of the excellent projects emerging in the spirit of auto-research. PrimeIntellect is tracking the ecosystem of tools being built on top of the autoresearch concept, and Flywheel appears to be gaining traction as a more structured framework for running optimization loops.
#20
@needhelptho
https://x.com/needhelptho/status/2039030060783636991
Open-sourced a generalized version of Karpathy's autoresearch that lets you optimize any piece of code. The move from a specific research tool to a general-purpose code optimizer is the natural evolution, and having an open-source version lowers the barrier to entry.
https://x.com/needhelptho/status/2039030060783636991
Open-sourced a generalized version of Karpathy's autoresearch that lets you optimize any piece of code. The move from a specific research tool to a general-purpose code optimizer is the natural evolution, and having an open-source version lowers the barrier to entry.
#21
@iuditg
https://x.com/iuditg/status/2039030066496332171
Teasing a major update to the /autoresearch command. No details yet on what is changing, but the fact that the tool is getting active development suggests the community demand is real.
https://x.com/iuditg/status/2039030066496332171
Teasing a major update to the /autoresearch command. No details yet on what is changing, but the fact that the tool is getting active development suggests the community demand is real.
#22
@AI_Boilerplate
https://x.com/AI_Boilerplate/status/2038881727397888186
Released Autoresearch Agent as a free agent skill you can install with one command via promptcreek. This is the kind of packaging that makes autoresearch accessible to people who do not want to set up their own infrastructure from scratch.
https://x.com/AI_Boilerplate/status/2038881727397888186
Released Autoresearch Agent as a free agent skill you can install with one command via promptcreek. This is the kind of packaging that makes autoresearch accessible to people who do not want to set up their own infrastructure from scratch.
Comments