Microsoft Research Quietly Open-Sources Orchard
Orchard dropped on arXiv on May 14 (2605.15040). Microsoft Research's open-source agentic modeling framework. The headline is not the paper itself, it is what they trained with it: three different agent recipes built on the same Kubernetes-native environment service, all hitting open-source SOTA on their respective benchmarks.
Orchard-SWE on Qwen3-30B-A3B-Thinking lands 67.5% on SWE-bench Verified through SFT plus RL, using a Balanced Adaptive Rollout trick for the sparse-reward problem. Orchard-GUI on Qwen3-VL-4B-Thinking averages 68.4% across WebVoyager, Online-Mind2Web, and DeepShop with only 2.6K total training tasks, competitive with OpenAI and Gemini browsers. Orchard-Claw, their personal-assistant variant, hits 73.9% pass@3 on a ZeroClaw harness using 200 synthetic tasks.
Orchard Env is the part to pay attention to. A thin K8s service exposing sandbox lifecycle, command execution, file I/O, and network policy through REST. 0.28s average command latency, 100% success at 1,000 concurrent sandboxes, 10x cost reduction versus Daytona or E2B with spot instances. This is the cheap, reusable layer that everyone training agent models has been hand-rolling for the past year.
Code is on github.com/microsoft/Orchard. The repo says "release on hold, we will release code soon," but the paper has enough detail and Microsoft has shipped intent. If you are training a coding or GUI agent, this is the new reference stack to clone.
← Back to all articles
Orchard-SWE on Qwen3-30B-A3B-Thinking lands 67.5% on SWE-bench Verified through SFT plus RL, using a Balanced Adaptive Rollout trick for the sparse-reward problem. Orchard-GUI on Qwen3-VL-4B-Thinking averages 68.4% across WebVoyager, Online-Mind2Web, and DeepShop with only 2.6K total training tasks, competitive with OpenAI and Gemini browsers. Orchard-Claw, their personal-assistant variant, hits 73.9% pass@3 on a ZeroClaw harness using 200 synthetic tasks.
Orchard Env is the part to pay attention to. A thin K8s service exposing sandbox lifecycle, command execution, file I/O, and network policy through REST. 0.28s average command latency, 100% success at 1,000 concurrent sandboxes, 10x cost reduction versus Daytona or E2B with spot instances. This is the cheap, reusable layer that everyone training agent models has been hand-rolling for the past year.
Code is on github.com/microsoft/Orchard. The repo says "release on hold, we will release code soon," but the paper has enough detail and Microsoft has shipped intent. If you are training a coding or GUI agent, this is the new reference stack to clone.
Comments