Skills Pulled Workflow Generation From 44% to 83% — Same LLM, Just Markdown
A new arXiv paper (2604.21910) from a Krakow group quietly vindicates the entire Skills bet. The setup: convert a scientist's natural-language research question into an executable computational workflow. Hard problem, classic LLM territory, full of long-tail vocabulary the base model has never seen.
Their architecture is three layers. Top layer: an LLM extracts intent from the question. Middle layer: a deterministic generator turns intent into a workflow DAG. Bottom layer: domain experts author Skills as plain Markdown documents that encode the vocabulary, constraints, and optimization tricks the base model does not know. The trick is structural — they confine all LLM non-determinism to the top layer alone. Same intent always produces the same workflow.
The numbers are the editorial point. Without Skills: 44% intent accuracy. With Skills: 83%. Same base model. The Skills also produced a 92% reduction in data transfer because they encoded which fields were actually needed. End-to-end overhead under 15 seconds, query cost under one tenth of a cent.
Why this matters beyond computational science. This is the third week in a row that the Skills format — Anthropic's Skills, mattpocock/skills, Composio's awesome-codex-skills, google/agents-cli skills, now this paper — is showing up in completely independent contexts. The bet underneath all of them is the same: you can give a frozen base model deep specialist competence by handing it a Markdown document an expert wrote. No fine-tuning, no RAG, no embeddings. Just words on a page.
If a 39-point accuracy lift on a real scientific task is what Markdown buys you, the question for everyone shipping agents is no longer whether to write Skills. It is who in your team is qualified to write the ones that matter.
Paper: https://arxiv.org/abs/2604.21910
← Back to all articles
Their architecture is three layers. Top layer: an LLM extracts intent from the question. Middle layer: a deterministic generator turns intent into a workflow DAG. Bottom layer: domain experts author Skills as plain Markdown documents that encode the vocabulary, constraints, and optimization tricks the base model does not know. The trick is structural — they confine all LLM non-determinism to the top layer alone. Same intent always produces the same workflow.
The numbers are the editorial point. Without Skills: 44% intent accuracy. With Skills: 83%. Same base model. The Skills also produced a 92% reduction in data transfer because they encoded which fields were actually needed. End-to-end overhead under 15 seconds, query cost under one tenth of a cent.
Why this matters beyond computational science. This is the third week in a row that the Skills format — Anthropic's Skills, mattpocock/skills, Composio's awesome-codex-skills, google/agents-cli skills, now this paper — is showing up in completely independent contexts. The bet underneath all of them is the same: you can give a frozen base model deep specialist competence by handing it a Markdown document an expert wrote. No fine-tuning, no RAG, no embeddings. Just words on a page.
If a 39-point accuracy lift on a real scientific task is what Markdown buys you, the question for everyone shipping agents is no longer whether to write Skills. It is who in your team is qualified to write the ones that matter.
Paper: https://arxiv.org/abs/2604.21910
Comments