June 7, 2026ResearchInfrastructureAgents

Vortex: agents wrote a faster attention kernel than the humans

Here's the part that should make you sit up: in this paper, AI agents automatically generated sparse-attention algorithms that hit up to 3.46x higher throughput than full attention, with accuracy intact. Not a human researcher hand-tuning kernels, agents searching the design space and finding wins.

Vortex is the system that makes that possible. It's a serving framework for sparse attention with a Python frontend, a tensor abstraction, and a backend that slots into real LLM serving stacks. The point is to crush the engineering overhead of trying a new sparse-attention idea, so that both human researchers and agents can prototype dozens of variants fast instead of spending weeks per kernel. When you make the iteration loop cheap enough, an agent can just grind through the space, and that's exactly what happened.

The numbers land on real models, not toys: 4.7x speedup on GLM-4.7-Flash, 1.37x on the 229-billion-parameter MiniMax-M2.7, running on NVIDIA B200s. That's production-scale long-context serving getting meaningfully cheaper from algorithms a machine discovered.

This is the same thread as the self-improving agent work, RHO fixing its own toolkit, MLEvolve out-evolving AlphaEvolve. The frontier keeps quietly moving from humans design the system and models run in it, to models design the system too. Paper: arxiv.org/abs/2606.06453
← Previous
An agent that fixes its own toolkit, no grader needed
Next β†’
Super User Daily: June 8, 2026
← Back to all articles

Comments

Loading...
>_