May 13, 2026ResearchAgentsRL

Continual Harness lets foundation agents improve themselves while still playing

Continual Harness landed on arXiv with 89 HF upvotes. Princeton team — Seth Karten, Chi Jin, Kiran Vodrahalli and others. The angle is the one most current "self-improving" papers dodge — improvement during continuous operation, no episode boundaries, no resets, no human evals between rounds.

The setup is novel. The agent doesn't train and then deploy. It alternates — act, refine its own prompt, refine its sub-agents, refine its skills, refine its memory, act again. All online. No training/eval split. They call the inner loop "process-reward co-learning" — the agent learns the reward function it should be optimizing at the same time as it optimizes against it.

Testbed: Pokemon. Yes, the games. Long-horizon, no clean reward, sub-tasks with dependencies, multiple game versions to test generalization. The Harness completes milestones in multiple Pokemon versions while substantially cutting compute compared to baselines, and recovers much of the gap to hand-engineered Pokemon-specific systems.

Why this matters even if you don't care about Pokemon. Every actually-deployed agent is operating in continuous mode. The current production playbook is "deploy, log, retrain, redeploy" — episodic offline RL with humans in the loop. Continual Harness is one of the first credible drafts of "deploy, and the agent gets better while it's deployed." If that works at scale, the entire agent-improvement org chart at every lab changes.

The Pokemon framing buries the lede. This is a continual-learning systems paper dressed as a fun benchmark.

https://arxiv.org/abs/2605.09998
← Previous
MemPrivacy puts a placeholder layer between your agent and the cloud
Next →
Super User Daily: 2026-05-14
← Back to all articles

Comments

Loading...
>_