A Paper Just Argued LangGraph and CrewAI Are Already Obsolete
A new paper out of the agent benchmarking circuit dropped April 30 with a title that is going to make a lot of orchestration startups uncomfortable: In-Context Prompting Obsoletes Agent Orchestration. The claim is that for procedural workflows, just stuffing the entire procedure into a system prompt beats LangGraph and CrewAI on quality and on failure rate. With the same model.
The experiments are concrete. Three domains: travel booking with 14 workflow nodes, Zoom technical support with 14 nodes, insurance claims processing with 55 nodes. 200 conversations per condition. The in-context approach scored 4.53–5.00 on a five-point quality scale; LangGraph orchestration scored 4.17–4.84. The kicker is the failure rates: 11.5% versus 24% on travel, 0.5% versus 9% on Zoom, 5% versus 17% on insurance. Roughly half the failure rate, on real procedural tasks, with no graph framework involved.
The argument they are actually making is more interesting than the title. They say the orchestration layer was solving a 2023 problem — small models couldn't hold long procedures in context, so you had to chop the work into nodes and route between them. Frontier models in 2026 can hold the whole procedure. The graph is no longer load-bearing. It's load. And the routing logic that used to live in the framework now lives in the model — which means you debug by editing the prompt, not by tracing through a DAG.
If the result holds up at scale, it points the same direction as a few other recent things — Anthropic's Skills, OpenAI's harness simplifications, the general migration toward LLM-native procedural reasoning. The orchestration layer might be a transition technology that's already on the way out. The frameworks built on it are going to need a story for what they sell when the model is the orchestrator.
https://arxiv.org/abs/2604.27891
← Back to all articles
The experiments are concrete. Three domains: travel booking with 14 workflow nodes, Zoom technical support with 14 nodes, insurance claims processing with 55 nodes. 200 conversations per condition. The in-context approach scored 4.53–5.00 on a five-point quality scale; LangGraph orchestration scored 4.17–4.84. The kicker is the failure rates: 11.5% versus 24% on travel, 0.5% versus 9% on Zoom, 5% versus 17% on insurance. Roughly half the failure rate, on real procedural tasks, with no graph framework involved.
The argument they are actually making is more interesting than the title. They say the orchestration layer was solving a 2023 problem — small models couldn't hold long procedures in context, so you had to chop the work into nodes and route between them. Frontier models in 2026 can hold the whole procedure. The graph is no longer load-bearing. It's load. And the routing logic that used to live in the framework now lives in the model — which means you debug by editing the prompt, not by tracing through a DAG.
If the result holds up at scale, it points the same direction as a few other recent things — Anthropic's Skills, OpenAI's harness simplifications, the general migration toward LLM-native procedural reasoning. The orchestration layer might be a transition technology that's already on the way out. The frameworks built on it are going to need a story for what they sell when the model is the orchestrator.
https://arxiv.org/abs/2604.27891
Comments