May 16, 2026InfrastructureOpen SourceResearch

Orthrus Makes Qwen3 Run 5x Faster Without Touching the Output

Orthrus surfaced on Hacker News today at 91 points. A small academic repo from chiennv2000 doing something most inference frameworks claim but few deliver: lossless decoding speedup. Three Qwen3 variants released. Orthrus-Qwen3-8B averages 5.36x speedup. The 4B variant 5.20x. The 1.7B variant 4.25x. Output distribution stays identical to the base model, not approximately, not in expectation.

The trick is a dual-view architecture. An autoregressive decoder and a diffusion decoder share the same KV cache, generate tokens in parallel, and reach intra-model consensus before any token leaves. Only O(1) memory overhead despite the second decoder, and the second decoder fine-tunes only 16% of base parameters. Code at github.com/chiennv2000/Orthrus, MIT, 104 stars and climbing.

Why this matters for agents: token economics is the binding constraint of 2026. Voice agents, long-horizon coding agents, computer-use agents, deep research, all bleed tokens fast. Quantization gives you speed at a quality cost. Speculative decoding gives you speed with verification overhead and approximate output. A 5x lossless speedup on Qwen3-8B means the same agent harness can either run five times the wallclock workload on the same hardware or cut latency to 20% on the same workload, with no quality regression to negotiate.

If you are running Qwen3 in any production agent loop, this is worth trying this week. Repo at github.com/chiennv2000/Orthrus.
← Previous
GovWell Raises $25M to Put Agents in the Municipal Permit Office
Next β†’
A New Way to Grade Agents That Beats Frontier Models as Judges
← Back to all articles

Comments

Loading...
>_