June 8, 2026InfrastructureResearch

Xiaomi Just Ran a 1T Model at 1000 Tokens a Second

Xiaomi's MiMo team, with their TileRT inference stack, just broke 1000 tokens per second on a one-trillion-parameter model. They did it on a single standard 8-GPU commodity node, no custom silicon, with demos peaking near 1200 tok/s. Decrypt put it bluntly: roughly 15 times faster than ChatGPT or Claude. That is not a typo.

How they got there is extreme model-system codesign. FP4 quantization to shrink the weights, DFlash speculative decoding to predict ahead, and TileRT squeezing the kernels. The UltraSpeed mode of MiMo-V2.5-Pro is gated behind a trial window from June 9 to 23 and costs 3x the standard rate for roughly 10x the speed. You pay a premium, but the throughput-per-dollar still wins handily.

Why this matters for agents specifically: agent loops are dominated by latency, not intelligence. Every tool call, every reasoning step, every retry is a round trip you sit and wait on. A 10-to-15x speedup doesn't just make chat feel snappier, it changes what is economically possible in a long-horizon agent that fires hundreds of model calls per task. The frontier isn't only getting smarter, it's getting fast enough that autonomous agents stop feeling like they think in slow motion.

And notice who set the record. A Chinese lab, on commodity hardware, beating the closed American frontier on raw speed. The interesting fights in AI are increasingly about the system around the model, not just the model. Link: https://mimo.xiaomi.com/blog/mimo-tilert-1000tps
← Previous
Apple Gave Up and Bought Gemini
Next β†’
The Benchmark That Asks If Your Code Would Actually Get Merged
← Back to all articles

Comments

Loading...
>_