TMAS Scales Test-Time Compute via Multi-Agent Synergy: Two Memory Banks, One Reasoning Loop.
TMAS dropped on arXiv yesterday. 10 authors out of IQuest Research and Beihang University. 35 upvotes on HuggingFace Papers today. The pitch is that current test-time scaling — generating multiple trajectories, sampling more, refining over rounds — is wasteful because the trajectories barely talk to each other. TMAS organizes inference as collaboration among specialized agents, with hierarchical memories making the cross-trajectory wiring explicit.
The architecture has two memory banks. The experience bank holds low-level reliable intermediate conclusions and local feedback — anything that one trajectory verified that another could reuse. The guideline bank holds high-level strategies that have already been explored, so subsequent rollouts can steer away from redundant reasoning patterns instead of repeating them. The agents then run inference with structured cross-references to both banks during refinement iterations.
The training side is a hybrid reward RL scheme designed for the framework — it jointly preserves base reasoning capability, rewards effective experience utilization (don't ignore the banks), and encourages exploration beyond previously attempted solutions (don't trivially copy). On challenging reasoning benchmarks the paper reports stronger iterative scaling than existing baselines, with hybrid reward training further lifting scaling effectiveness and stability. Code at github.com/george-QF/TMAS-code.
Why this matters in the test-time scaling debate — the pure self-consistency and best-of-N approaches scale compute roughly linearly with quality, then plateau. Structured approaches like Tree-of-Thought, Forest-of-Thought, debate-based methods extract more out of the budget but plateau higher up. TMAS argues the next slot is making the agent collective coordinate explicitly through shared memory rather than implicit consensus voting. Two memory banks instead of a vote tally.
Place this next to AutoTTS from this morning — agentic discovery of test-time scaling policies — and the broader cluster from past two weeks. HyperEyes makes tool calls efficient. AutoTTS makes the meta-controller agentic. TMAS makes the trajectories collaborative. Each one is attacking a different efficiency leak in the same scaling regime. arxiv.org/abs/2605.10344.
← Back to all articles
The architecture has two memory banks. The experience bank holds low-level reliable intermediate conclusions and local feedback — anything that one trajectory verified that another could reuse. The guideline bank holds high-level strategies that have already been explored, so subsequent rollouts can steer away from redundant reasoning patterns instead of repeating them. The agents then run inference with structured cross-references to both banks during refinement iterations.
The training side is a hybrid reward RL scheme designed for the framework — it jointly preserves base reasoning capability, rewards effective experience utilization (don't ignore the banks), and encourages exploration beyond previously attempted solutions (don't trivially copy). On challenging reasoning benchmarks the paper reports stronger iterative scaling than existing baselines, with hybrid reward training further lifting scaling effectiveness and stability. Code at github.com/george-QF/TMAS-code.
Why this matters in the test-time scaling debate — the pure self-consistency and best-of-N approaches scale compute roughly linearly with quality, then plateau. Structured approaches like Tree-of-Thought, Forest-of-Thought, debate-based methods extract more out of the budget but plateau higher up. TMAS argues the next slot is making the agent collective coordinate explicitly through shared memory rather than implicit consensus voting. Two memory banks instead of a vote tally.
Place this next to AutoTTS from this morning — agentic discovery of test-time scaling policies — and the broader cluster from past two weeks. HyperEyes makes tool calls efficient. AutoTTS makes the meta-controller agentic. TMAS makes the trajectories collaborative. Each one is attacking a different efficiency leak in the same scaling regime. arxiv.org/abs/2605.10344.
Comments