SkillLearnBench Says Continual Learning Still Hasn't Solved It
SkillLearnBench is the first proper benchmark for the question every agent team has been hand-waving for two years: can your agent actually learn skills from its own experience and reuse them? The answer in this paper is no, not really, not yet, and that's the useful part.
The setup is 20 verified skill-dependent tasks across 15 sub-domains, drawn from a real-world skill taxonomy rather than synthetic toy land. Evaluation runs at three layers — skill quality (is the skill the agent generated actually any good), execution trajectory (did it follow the plan), and final task outcome. The authors throw recent continual-learning techniques at it: one-shot learning, self-feedback, teacher feedback, dedicated skill creators.
Finding: every method beats the no-skill baseline, which confirms continual learning isn't worthless. But no method wins across all tasks and LLMs. Worse, scaling to bigger LLMs doesn't reliably help. That's the kind of result that sounds boring at first and then sticks. It means "just wait for GPT-6" isn't a continual-learning strategy. The bottleneck is the learning loop itself, not the model behind it.
The practical implication for anyone building an agent that's supposed to get better over time: stop assuming a smarter base model will rescue your skill library. The mechanism — when to extract a skill, how to test it, when to delete a bad one, how to retrieve at the right moment — is its own research problem. Karpathy's been saying this; SkillLearnBench is the first benchmark that lets you measure who's actually solving it. https://arxiv.org/abs/2604.20087
← Back to all articles
The setup is 20 verified skill-dependent tasks across 15 sub-domains, drawn from a real-world skill taxonomy rather than synthetic toy land. Evaluation runs at three layers — skill quality (is the skill the agent generated actually any good), execution trajectory (did it follow the plan), and final task outcome. The authors throw recent continual-learning techniques at it: one-shot learning, self-feedback, teacher feedback, dedicated skill creators.
Finding: every method beats the no-skill baseline, which confirms continual learning isn't worthless. But no method wins across all tasks and LLMs. Worse, scaling to bigger LLMs doesn't reliably help. That's the kind of result that sounds boring at first and then sticks. It means "just wait for GPT-6" isn't a continual-learning strategy. The bottleneck is the learning loop itself, not the model behind it.
The practical implication for anyone building an agent that's supposed to get better over time: stop assuming a smarter base model will rescue your skill library. The mechanism — when to extract a skill, how to test it, when to delete a bad one, how to retrieve at the right moment — is its own research problem. Karpathy's been saying this; SkillLearnBench is the first benchmark that lets you measure who's actually solving it. https://arxiv.org/abs/2604.20087
Comments