Tencent Open-Sources HY-Embodied — A 2B Brain for Robot Agents
Tencent just open-sourced HY-Embodied-0.5, and it might be the most practical embodied AI model released this year. Two variants: a compact 2B model that runs on edge devices, and a 32B model for heavy reasoning. The 2B version already outperforms models of similar size across 16 benchmarks.
The architecture is clever. They use Mixture-of-Transformers (MoT) with latent tokens for modality-specific computing. In plain English: the model efficiently processes different types of input — text instructions, visual scenes, spatial data — without wasting compute on irrelevant modalities. This is specifically designed for real-world robots that need to see, understand, plan, and act, all in real time on limited hardware.
What makes HY-Embodied different from generic vision-language models is the focus on spatial-temporal perception and embodied reasoning. It doesn't just describe what it sees — it predicts what will happen, plans interactions, and reasons about physical constraints. The 32B variant hits frontier-level performance comparable to Gemini 3.0 Pro on embodied tasks.
The training approach is interesting too. They use a self-evolving post-training paradigm where the larger model teaches the smaller one through on-policy distillation. The result is a 2B model that punches way above its weight — small enough for a robot's onboard compute, smart enough to actually be useful.
343 upvotes on HuggingFace Papers. Weights, code, and training pipeline all open-sourced. If you're building anything in robotics or physical AI, this just became your baseline.
https://github.com/Tencent-Hunyuan/HY-Embodied
← Back to all articles
The architecture is clever. They use Mixture-of-Transformers (MoT) with latent tokens for modality-specific computing. In plain English: the model efficiently processes different types of input — text instructions, visual scenes, spatial data — without wasting compute on irrelevant modalities. This is specifically designed for real-world robots that need to see, understand, plan, and act, all in real time on limited hardware.
What makes HY-Embodied different from generic vision-language models is the focus on spatial-temporal perception and embodied reasoning. It doesn't just describe what it sees — it predicts what will happen, plans interactions, and reasons about physical constraints. The 32B variant hits frontier-level performance comparable to Gemini 3.0 Pro on embodied tasks.
The training approach is interesting too. They use a self-evolving post-training paradigm where the larger model teaches the smaller one through on-policy distillation. The result is a 2B model that punches way above its weight — small enough for a robot's onboard compute, smart enough to actually be useful.
343 upvotes on HuggingFace Papers. Weights, code, and training pipeline all open-sourced. If you're building anything in robotics or physical AI, this just became your baseline.
https://github.com/Tencent-Hunyuan/HY-Embodied
Comments