SLIM Treats Agent Skills as a Live Inventory: Retain, Retire, Expand During RL Training.
SLIM dropped on arXiv today. CUHK Database Group, with a U Florida co-author. The framing is that skill-based agentic RL has been stuck in two paradigms — accumulate skills forever, or eliminate them entirely. SLIM proposes a third — manage skills as a live inventory, with retain, retire, and expand as explicit operations driven by validation-time marginal contribution.
The mechanism is leave-one-skill-out validation. For every skill in the active set, the framework periodically measures how much that skill is actually contributing to current task performance, with EMA smoothing to filter noise. Skills with stable positive contribution stay in. Skills whose marginal contribution drops below threshold after enough exposure get retired. When the agent shows persistent failures on a slice of tasks, a new skill gets expanded into the set to cover the gap.
The numbers — on ALFWorld, SLIM hits 87.5% versus 75.0% for the strongest skill-based RL baseline, a 12.5-point gap. On SearchQA, 41.0% versus 39.3% for Skill0, +1.7 points. Average gain across both benchmarks is +7.1 points. Ablations are the part that nails the framing — turning off retirement degrades to 73.4%, turning off expansion to 78.9%, and random audits (without contribution-aware decisions) to 68.8%. The lifecycle has to be contribution-aware to work.
The structurally important finding — SLIM converges to non-empty, non-monotonic skill sets. The optimal active set is task-dependent and stage-dependent. On ALFWorld the with-skills versus without-skills gap stays large (87.5 vs 72.7) because procedural tasks externalize cleanly. On SearchQA the gap nearly vanishes (41.0 vs 38.6) because the policy internalizes the benefit during training. Skills and policy are complementary, not redundant — and the boundary between them should be learned, not fixed.
Where this fits in the agent research cluster — SkillRL, Skill0, Ctx2Skill, SKILL0, SkillSynth, andrej-karpathy-skills, Anthropic Skills, addyosmani agent-skills. The whole skills category has been about how to acquire and apply them. SLIM is the first explicit framework for managing them at runtime — the operations research piece of the skill stack. Code at github.com/ejhshen/SLIM. arxiv.org/abs/2605.10923.
← Back to all articles
The mechanism is leave-one-skill-out validation. For every skill in the active set, the framework periodically measures how much that skill is actually contributing to current task performance, with EMA smoothing to filter noise. Skills with stable positive contribution stay in. Skills whose marginal contribution drops below threshold after enough exposure get retired. When the agent shows persistent failures on a slice of tasks, a new skill gets expanded into the set to cover the gap.
The numbers — on ALFWorld, SLIM hits 87.5% versus 75.0% for the strongest skill-based RL baseline, a 12.5-point gap. On SearchQA, 41.0% versus 39.3% for Skill0, +1.7 points. Average gain across both benchmarks is +7.1 points. Ablations are the part that nails the framing — turning off retirement degrades to 73.4%, turning off expansion to 78.9%, and random audits (without contribution-aware decisions) to 68.8%. The lifecycle has to be contribution-aware to work.
The structurally important finding — SLIM converges to non-empty, non-monotonic skill sets. The optimal active set is task-dependent and stage-dependent. On ALFWorld the with-skills versus without-skills gap stays large (87.5 vs 72.7) because procedural tasks externalize cleanly. On SearchQA the gap nearly vanishes (41.0 vs 38.6) because the policy internalizes the benefit during training. Skills and policy are complementary, not redundant — and the boundary between them should be learned, not fixed.
Where this fits in the agent research cluster — SkillRL, Skill0, Ctx2Skill, SKILL0, SkillSynth, andrej-karpathy-skills, Anthropic Skills, addyosmani agent-skills. The whole skills category has been about how to acquire and apply them. SLIM is the first explicit framework for managing them at runtime — the operations research piece of the skill stack. Code at github.com/ejhshen/SLIM. arxiv.org/abs/2605.10923.
Comments