May 16, 2026ResearchOpen SourceInfrastructure

NVIDIA Open-Sources SANA-WM, a Minute of 720p Video on One GPU

NVIDIA Labs dropped SANA-WM yesterday. 2.6 billion parameters. Open source. World model that takes one image plus a camera trajectory and generates a 60-second 720p video that obeys the camera path. The distilled variant runs the whole minute in 34 seconds on a single RTX 5090 with NVFP4 quantization. On HN it jumped to 275 points.

The architecture is the interesting part. Hybrid linear attention combining frame-wise Gated DeltaNet with softmax attention, so the memory cost does not blow up as the video gets long. A dual-branch camera control head enforces 6-DoF trajectory adherence, so the camera actually goes where you told it. Two-stage pipeline with a long-video refiner on top to keep quality consistent across the minute. Trained on roughly 213,000 public video clips with extracted pose labels. 15 days on 64 H100s. The headline claim against the closed industrial baselines LingBot-World and HY-WorldPlay: comparable visual quality, 36x higher throughput.

Why this belongs in an agent newsletter. World models are how embodied and computer-use agents do counterfactual planning. If you can simulate one minute of consequence from a starting frame and a control signal, you can let an agent dream the next minute before committing to it. Open weights at the 2.6B scale plus single-GPU inference means academic labs and small teams can finally train world-model-conditioned agents without renting an H100 cluster.

License is CC BY-NC-SA 4.0, so commercial deployment needs paperwork, but research and personal use are clean. Project page at nvlabs.github.io/Sana/WM/. Paper is arXiv 2605.15178.

https://nvlabs.github.io/Sana/WM/
← Previous
codegraph Gives Claude Code a Local Knowledge Graph
Next β†’
Kimi WebBridge Turns Your Browser Into Local Hands for Any Agent
← Back to all articles

Comments

Loading...
>_