Cost-Aware Orchestration: AI Just Got Its Own Physics
This week only had one story.
On May 28, Anthropic shipped Opus 4.8, with Dynamic Workflows letting Claude Code spawn 100-1000 subagents from a single prompt. The same day, Salesforce published an engineering writeup — a migration originally scoped at 231 days shipped in 13, one PR delivered 21 endpoints at 100% test coverage. The next day, Bun's Jarred Sumner used the same machinery to port the entire Bun runtime from Zig to Rust — 750K lines, two independent reviewers per file, 11 days from first commit to merge, original test suite passing at 99.8%.
The first reaction most people had was "wow, AI got stronger again." The thing I actually saw was "the basic unit of work just changed."
For the past year, every "using AI" story was one person against one prompt. The boldest players were stringing a prompt into a few subtasks. Now you type one sentence, Claude writes a JS orchestration script, drops it into a Node vm sandbox, calls agent() parallel() pipeline() primitives, 128 subagents go concurrent on your laptop. The main conversation window only ever sees the final converged answer.
This is not "smarter AI." It's a new physical unit.
The clearest evidence is the token-consumption curve. @wshuyi ran a data analysis task — first stage alone spawned 108 subagents, Claude Code Max 5x's 5-hour quota drained in 20 minutes. @Jeremybtc ran a "read-only" codebase audit — 2 hours, 139 subagents, 4.7M tokens. @yoshio_nocode typed /deep-research — 101 agents, 3.4M tokens, single query. These aren't Anthropic researchers. These are normal users.
Happening on the same curve, in the opposite direction, is a collapse.
Microsoft canceled most internal Claude Code licenses. Uber burned through its entire 2026 AI budget by April — the COO publicly admitted he "can't draw a line between this spending and actual user-facing improvements." Some unnamed Fortune 20 company got a $500M Claude bill in a month because nobody set per-employee usage caps. Axios put Tokenmaxxing on its mainstream headline.
These two curves are not contradictions. They're two sides of the same thing.
Jensen Huang said something two months ago that keeps getting quoted: "If my $500K-a-year engineer didn't burn at least $250K worth of tokens, I'd be deeply alarmed." That isn't a sales pitch — that's the official endorsement of "industrial-scale token consumption = productivity signal." Today, the data backs him up: @mardehaym runs a 2-person AI-augmented engineering team at $200/dev/month, 6 months in, 330 PRs shipped, 90% AI-generated. Every token traces back to a ticket, every ticket traces back to shipped code. Same mechanism: @tomcrawshaw01 burned $2,123 worth of metered usage on a flat $100/month Max plan last month — a 21x value-to-price gap.
Burning tokens isn't wrong. **Burning them with no discipline is.**
So where's the governance layer? Every product shape that surfaced this week pointed to the same answer: bake cost awareness into the runtime, not as post-hoc alerts.
Modiqo (co-led by Heavybit + Seligman, $3M pre-seed) is shipping the most explicit version. Their product Rote captures successful agent runs and turns them into deterministic, reusable workflows. An agent rediscovers the same API, prompt, script, and edge case the next day, and burns tokens every time. Rote sits underneath the agent loop and preserves working paths as durable assets teams can repeat, inspect, and improve. This isn't "make the agent smarter." It's "teach it when to stop thinking."
Step 3.7 Flash's advisor mode is another path. The small executor (11B activated) runs the agentic loop and only escalates to a frontier-class advisor at planning or failure points. SWE-Bench Verified: 76.3% at $0.19 per task. Claude Opus 4.6: 78.7% at $1.76 per task. Roughly the same coding capability, 9% of the cost. The "frontier model on every loop iteration" era is ending fast.
@dunik_7 posted a recipe the day Opus 4.8 launched: Low effort on the 60% of prompts that are "format this" or "what does this return"; High on daily coding; Max on the hard 10% of architecture; Fast mode (3x cheaper now) on big mechanical refactors. Default $400/month → routed correctly $200/month. That's not technology. It's discipline. But he packaged the discipline as a recipe.
@Royal_Arse's grumpiest counter-take this week is maybe the most useful: "18 months hacking with frontier models, 50+ hours a week, billions of tokens — only 3 times spent over $100 in a single session. The major spenders are lazy morons who /loop forever hoping the machine sorts it out. That's a fireable offense in most cases." He built a cost-guard extension in 3 minutes, distributed it company-wide opt-in.
That's the whole shape of this week. Up top: the new productivity that "spawns 100 subagents from one prompt." Down below: the discipline case of "2 people, 6 months, 330 PRs." The collapsed middle layer — what traditional software didn't design, what traditional SaaS doesn't sell, what traditional finance can't audit — that's the largest product opening of the next 6 months.
I'm calling it cost-aware orchestration.
What does it look like, specifically? A few directions I'm watching.
First, per-call budget gates. Claude Code is doing it partially — the first time a workflow triggers, it shows "what's about to happen" and asks for confirmation. But that's a ceremonial prompt, not a real interception point before spawning 100 agents. The next step is budget-aware orchestration: the runtime knows at script time how many tokens it's been allocated, how many remain, how many branches haven't fired, whether to terminate a hypothesis, whether to degrade to a cheaper model. Not a new concept — Kubernetes has had resource quotas for years — but the AI agent runtime layer hasn't gotten this right.
Second, per-ticket token tracing. @mardehaym's setup should be the default. Every ticket should auto-emit a small report: "this feature used X tokens, Y API calls, spanned Z agents, produced N lines of code." Not just for audit — more importantly, this is the only way an organization learns how to estimate AI work. Right now everything is a black box. Most teams don't even know how many tokens Claude is using on their behalf, let alone match it to business outcomes.
Third, cache discipline as a first-class primitive. Anthropic's prompt caching is treated as a discount today. It should be a protocol. Every agent should silently account for "how reachable is the next caller's prefix" and a cache miss is something it has to explain. @GeorgeWzheng1's telos-sdk is doing exactly this — restructuring the agent loop around vLLM's prefix-caching contract — and claims ~90% end-to-end token cost reduction. That's the kind of plumbing win that fades into the stack in two years and nobody remembers how they lived without it.
Fourth, loop-as-resource accounting. When one loop runs 4 hours, burns 2M tokens, and produces 1 improvement PR, vs another loop runs 30 minutes, burns 50K tokens, and produces 1 improvement PR — those are different things. The first might be the necessary deep search, or it might be agent psychosis (see @TheChowdhary's $500 burn-and-fail story). What separates them isn't a smarter agent. It's an accounting layer. Each loop needs its own ROI tag — auto-computed at runtime, not human-labeled after the fact.
Where does this cost-aware orchestration layer end up? I see two paths.
One path: the big platforms ship it themselves. Anthropic already added effort control and fast mode in Dynamic Workflows. Budget primitives are obviously next. Vercel, Mastra, AGNT — every agent platform is moving in this direction. The advantage: close to the model. The disadvantage: it only serves the platform's own runtime.
Other path: independent middleware that becomes a cross-runtime cost governance layer. Modiqo/Rote is the early big player here. Add token-optimizer-mcp, trimcp, sqz, snip — the token compression repos @seelffff aggregated (real, working libraries that almost nobody is using yet) — and the cost middleware shelf is forming fast.
Which wins? I'm betting on the second. The big platforms will build it, but they can't go deep — their KPIs are token consumption itself.
Step back further. What this week actually marks is the moment AI got its own physics. The token is the basic unit. The loop is the motion. The cache is potential energy. The verifier is gravity. You used to write "software," with functions and files as the unit. Now you write "intelligent labor," with prompts and tokens and agents and workflows as the unit. Both measurement systems are legitimate. They apply to different domains.
@dessaigne's post that got 179K impressions on Twitter — "spend tokens, not headcount" — is the founder-facing version. For the rest of the industry, the more accurate version is: "spend tokens, but every token answers to a shipped artifact." The first half of that sentence is how you grow. The second half is how you survive.
While Claude Code quietly spawns 100 subagents in your terminal this Sunday evening to fix a bug, the first wave of people who know how to put financial discipline on those subagents are already building the next infrastructure layer.
This is going to be louder next week.
← Back to all articles
On May 28, Anthropic shipped Opus 4.8, with Dynamic Workflows letting Claude Code spawn 100-1000 subagents from a single prompt. The same day, Salesforce published an engineering writeup — a migration originally scoped at 231 days shipped in 13, one PR delivered 21 endpoints at 100% test coverage. The next day, Bun's Jarred Sumner used the same machinery to port the entire Bun runtime from Zig to Rust — 750K lines, two independent reviewers per file, 11 days from first commit to merge, original test suite passing at 99.8%.
The first reaction most people had was "wow, AI got stronger again." The thing I actually saw was "the basic unit of work just changed."
For the past year, every "using AI" story was one person against one prompt. The boldest players were stringing a prompt into a few subtasks. Now you type one sentence, Claude writes a JS orchestration script, drops it into a Node vm sandbox, calls agent() parallel() pipeline() primitives, 128 subagents go concurrent on your laptop. The main conversation window only ever sees the final converged answer.
This is not "smarter AI." It's a new physical unit.
The clearest evidence is the token-consumption curve. @wshuyi ran a data analysis task — first stage alone spawned 108 subagents, Claude Code Max 5x's 5-hour quota drained in 20 minutes. @Jeremybtc ran a "read-only" codebase audit — 2 hours, 139 subagents, 4.7M tokens. @yoshio_nocode typed /deep-research — 101 agents, 3.4M tokens, single query. These aren't Anthropic researchers. These are normal users.
Happening on the same curve, in the opposite direction, is a collapse.
Microsoft canceled most internal Claude Code licenses. Uber burned through its entire 2026 AI budget by April — the COO publicly admitted he "can't draw a line between this spending and actual user-facing improvements." Some unnamed Fortune 20 company got a $500M Claude bill in a month because nobody set per-employee usage caps. Axios put Tokenmaxxing on its mainstream headline.
These two curves are not contradictions. They're two sides of the same thing.
Jensen Huang said something two months ago that keeps getting quoted: "If my $500K-a-year engineer didn't burn at least $250K worth of tokens, I'd be deeply alarmed." That isn't a sales pitch — that's the official endorsement of "industrial-scale token consumption = productivity signal." Today, the data backs him up: @mardehaym runs a 2-person AI-augmented engineering team at $200/dev/month, 6 months in, 330 PRs shipped, 90% AI-generated. Every token traces back to a ticket, every ticket traces back to shipped code. Same mechanism: @tomcrawshaw01 burned $2,123 worth of metered usage on a flat $100/month Max plan last month — a 21x value-to-price gap.
Burning tokens isn't wrong. **Burning them with no discipline is.**
So where's the governance layer? Every product shape that surfaced this week pointed to the same answer: bake cost awareness into the runtime, not as post-hoc alerts.
Modiqo (co-led by Heavybit + Seligman, $3M pre-seed) is shipping the most explicit version. Their product Rote captures successful agent runs and turns them into deterministic, reusable workflows. An agent rediscovers the same API, prompt, script, and edge case the next day, and burns tokens every time. Rote sits underneath the agent loop and preserves working paths as durable assets teams can repeat, inspect, and improve. This isn't "make the agent smarter." It's "teach it when to stop thinking."
Step 3.7 Flash's advisor mode is another path. The small executor (11B activated) runs the agentic loop and only escalates to a frontier-class advisor at planning or failure points. SWE-Bench Verified: 76.3% at $0.19 per task. Claude Opus 4.6: 78.7% at $1.76 per task. Roughly the same coding capability, 9% of the cost. The "frontier model on every loop iteration" era is ending fast.
@dunik_7 posted a recipe the day Opus 4.8 launched: Low effort on the 60% of prompts that are "format this" or "what does this return"; High on daily coding; Max on the hard 10% of architecture; Fast mode (3x cheaper now) on big mechanical refactors. Default $400/month → routed correctly $200/month. That's not technology. It's discipline. But he packaged the discipline as a recipe.
@Royal_Arse's grumpiest counter-take this week is maybe the most useful: "18 months hacking with frontier models, 50+ hours a week, billions of tokens — only 3 times spent over $100 in a single session. The major spenders are lazy morons who /loop forever hoping the machine sorts it out. That's a fireable offense in most cases." He built a cost-guard extension in 3 minutes, distributed it company-wide opt-in.
That's the whole shape of this week. Up top: the new productivity that "spawns 100 subagents from one prompt." Down below: the discipline case of "2 people, 6 months, 330 PRs." The collapsed middle layer — what traditional software didn't design, what traditional SaaS doesn't sell, what traditional finance can't audit — that's the largest product opening of the next 6 months.
I'm calling it cost-aware orchestration.
What does it look like, specifically? A few directions I'm watching.
First, per-call budget gates. Claude Code is doing it partially — the first time a workflow triggers, it shows "what's about to happen" and asks for confirmation. But that's a ceremonial prompt, not a real interception point before spawning 100 agents. The next step is budget-aware orchestration: the runtime knows at script time how many tokens it's been allocated, how many remain, how many branches haven't fired, whether to terminate a hypothesis, whether to degrade to a cheaper model. Not a new concept — Kubernetes has had resource quotas for years — but the AI agent runtime layer hasn't gotten this right.
Second, per-ticket token tracing. @mardehaym's setup should be the default. Every ticket should auto-emit a small report: "this feature used X tokens, Y API calls, spanned Z agents, produced N lines of code." Not just for audit — more importantly, this is the only way an organization learns how to estimate AI work. Right now everything is a black box. Most teams don't even know how many tokens Claude is using on their behalf, let alone match it to business outcomes.
Third, cache discipline as a first-class primitive. Anthropic's prompt caching is treated as a discount today. It should be a protocol. Every agent should silently account for "how reachable is the next caller's prefix" and a cache miss is something it has to explain. @GeorgeWzheng1's telos-sdk is doing exactly this — restructuring the agent loop around vLLM's prefix-caching contract — and claims ~90% end-to-end token cost reduction. That's the kind of plumbing win that fades into the stack in two years and nobody remembers how they lived without it.
Fourth, loop-as-resource accounting. When one loop runs 4 hours, burns 2M tokens, and produces 1 improvement PR, vs another loop runs 30 minutes, burns 50K tokens, and produces 1 improvement PR — those are different things. The first might be the necessary deep search, or it might be agent psychosis (see @TheChowdhary's $500 burn-and-fail story). What separates them isn't a smarter agent. It's an accounting layer. Each loop needs its own ROI tag — auto-computed at runtime, not human-labeled after the fact.
Where does this cost-aware orchestration layer end up? I see two paths.
One path: the big platforms ship it themselves. Anthropic already added effort control and fast mode in Dynamic Workflows. Budget primitives are obviously next. Vercel, Mastra, AGNT — every agent platform is moving in this direction. The advantage: close to the model. The disadvantage: it only serves the platform's own runtime.
Other path: independent middleware that becomes a cross-runtime cost governance layer. Modiqo/Rote is the early big player here. Add token-optimizer-mcp, trimcp, sqz, snip — the token compression repos @seelffff aggregated (real, working libraries that almost nobody is using yet) — and the cost middleware shelf is forming fast.
Which wins? I'm betting on the second. The big platforms will build it, but they can't go deep — their KPIs are token consumption itself.
Step back further. What this week actually marks is the moment AI got its own physics. The token is the basic unit. The loop is the motion. The cache is potential energy. The verifier is gravity. You used to write "software," with functions and files as the unit. Now you write "intelligent labor," with prompts and tokens and agents and workflows as the unit. Both measurement systems are legitimate. They apply to different domains.
@dessaigne's post that got 179K impressions on Twitter — "spend tokens, not headcount" — is the founder-facing version. For the rest of the industry, the more accurate version is: "spend tokens, but every token answers to a shipped artifact." The first half of that sentence is how you grow. The second half is how you survive.
While Claude Code quietly spawns 100 subagents in your terminal this Sunday evening to fix a bug, the first wave of people who know how to put financial discipline on those subagents are already building the next infrastructure layer.
This is going to be louder next week.
Comments