Orca wants one brain for video, language and action
BAAI dropped Orca and it's the number one paper on Hugging Face today by a wide margin, 176 upvotes when the next one is at 21. The pitch is in the subtitle, the world is in your mind. Instead of training three separate reflexes, next token for text, next frame for video, next action for a robot, Orca learns one thing, next state. A single latent picture of what the world is about to do.
How it gets there is the clever bit. Two modes. Unconscious learning eats raw video with no labels and soaks up the dense, boring transitions of how things move. Conscious learning uses language-described events and question answering to grab the sparse, meaningful moments. Freeze that encoder and you can decode the same world-latent into text, into images, or into robot motor commands.
The results are the reason to care. It beats comparable vision-language models on the temporal benchmarks, MVBench and TemporalBench, and it pulls off real-robot manipulation having only ever watched video. No action labels in pretraining, and it still learns to act. That is the emergent bit everyone chasing embodied agents wants.
This lands right in the middle of the world-model-for-agents wave, Odyssey just raised 310 million for exactly this, General Intuition is training on gameplay, Qwen shipped AgentWorld. The thesis underneath all of it, if you want an agent that acts in the physical world, a model that predicts the next word is not enough, you need one that predicts what happens next. Paper at arxiv.org/abs/2606.30534
← Back to all articles
How it gets there is the clever bit. Two modes. Unconscious learning eats raw video with no labels and soaks up the dense, boring transitions of how things move. Conscious learning uses language-described events and question answering to grab the sparse, meaningful moments. Freeze that encoder and you can decode the same world-latent into text, into images, or into robot motor commands.
The results are the reason to care. It beats comparable vision-language models on the temporal benchmarks, MVBench and TemporalBench, and it pulls off real-robot manipulation having only ever watched video. No action labels in pretraining, and it still learns to act. That is the emergent bit everyone chasing embodied agents wants.
This lands right in the middle of the world-model-for-agents wave, Odyssey just raised 310 million for exactly this, General Intuition is training on gameplay, Qwen shipped AgentWorld. The thesis underneath all of it, if you want an agent that acts in the physical world, a model that predicts the next word is not enough, you need one that predicts what happens next. Paper at arxiv.org/abs/2606.30534
Comments