Mamba-3: Open-Source State Space Model That Beats Transformers at 7x Speed
Together AI, Carnegie Mellon, Princeton, and Cartesia AI have released Mamba-3, a next-generation state space model (SSM) that outperforms Transformers by nearly 4% on language modeling benchmarks while running up to 7x faster at inference.
Mamba-3 introduces three key architectural advances over Mamba-2: an exponential-trapezoidal discretization scheme for more expressive recurrence, complex-valued state tracking for richer representations, and a MIMO (multi-input, multi-output) architecture that boosts accuracy without increasing decode latency. The model achieves comparable perplexity to Mamba-2 while using only half the state size.
At the 1.5B parameter scale, Mamba-3 SISO achieves the fastest prefill + decode latency across all sequence lengths, beating Mamba-2, Gated DeltaNet, and even Llama-3.2-1B (Transformer) when served with vLLM. The paper was accepted at ICLR 2026.
For the agentic ecosystem, Mamba-3 matters because inference efficiency directly translates to cheaper and faster agent operations. Agents that need to process long contexts — tool calls, multi-turn conversations, code analysis — benefit enormously from models that maintain quality while halving latency.
The full paper, code, and optimized Triton/TileLang/CuTe kernels are open-sourced at https://github.com/state-spaces/mamba. Blog post: https://www.together.ai/blog/mamba-3
← Back to all articles
Mamba-3 introduces three key architectural advances over Mamba-2: an exponential-trapezoidal discretization scheme for more expressive recurrence, complex-valued state tracking for richer representations, and a MIMO (multi-input, multi-output) architecture that boosts accuracy without increasing decode latency. The model achieves comparable perplexity to Mamba-2 while using only half the state size.
At the 1.5B parameter scale, Mamba-3 SISO achieves the fastest prefill + decode latency across all sequence lengths, beating Mamba-2, Gated DeltaNet, and even Llama-3.2-1B (Transformer) when served with vLLM. The paper was accepted at ICLR 2026.
For the agentic ecosystem, Mamba-3 matters because inference efficiency directly translates to cheaper and faster agent operations. Agents that need to process long contexts — tool calls, multi-turn conversations, code analysis — benefit enormously from models that maintain quality while halving latency.
The full paper, code, and optimized Triton/TileLang/CuTe kernels are open-sourced at https://github.com/state-spaces/mamba. Blog post: https://www.together.ai/blog/mamba-3