July 5, 2026InfrastructureOpen Source

GLM-5.2 on AMD, at Half of Nvidia's Cost

Wafer put GLM-5.2 on AMD's MI355X and published the numbers, and the line that's going around is their own: the CUDA moat is eroding in real time. They got 2,626 tokens per second per node and 213 on a single stream, roughly 80 percent of an Nvidia B200, at over 2x lower cost. The MI355X runs about 2.75x cheaper per GPU than a B300. So you give up a fifth of the speed and pay less than half. For anyone serving a model at scale, that math is not close.

The interesting claim isn't the speed, it's the diagnosis. Wafer says AMD's gap was never the silicon, it was software support. They quantized GLM-5.2 to MXFP4 with AMD's Quark, ran it on sglang, and fixed the bugs blocking speculative decoding and MoE kernel selection. Once the plumbing worked, the hardware was right there. SOTA on AMD, in their words, is becoming a matter of support, not silicon.

This is the story that keeps Nvidia's CFO up at night, told with a specific frontier-class open model instead of a slide. The entire premium on Nvidia hardware rests on the assumption that CUDA is a decade-deep software moat nobody can cross. Every time someone crosses it in public with real throughput numbers on a real model, that assumption gets cheaper, and so does inference.

And it's not a coincidence it's an open Chinese model doing it. GLM-5.2 you can download and quantize however you want. You can't run this experiment on a closed API. Open weights plus cheaper silicon is the combination that actually pressures the whole stack.

Link: wafer.ai/blog/glm52-amd
← Previous
Matt Pocock's .claude Folder Has 156k Stars
Next β†’
To Test Agent Memory, They Made It Play Slay the Spire
← Back to all articles

Comments

Loading...
>_