GEMS: A 6B Model That Beats SOTA by Teaching Itself New Skills
Most image generation research focuses on making bigger models or better training data. GEMS takes a different approach. Instead of scaling the model, it wraps a small 6B parameter model in an agent loop that iteratively improves its own output, remembers what worked, and learns new skills along the way.
The framework has three components. Agent Loop runs a multi-agent system that critiques and refines generation quality through closed-loop optimization. Think of it as the model arguing with itself until the output is actually good. Agent Memory stores successful trajectories so the system does not repeat mistakes or rediscover solutions it already found. Agent Skill is an extensible library of domain-specific capabilities that the system accumulates over time.
The result is a 6B model that exceeds state-of-the-art performance on GenEval2, a benchmark for text-to-image generation quality. A model one-tenth the size of its competitors winning through architecture rather than brute force.
This matters beyond image generation. The pattern of agent loop plus persistent memory plus skill accumulation is exactly how we want coding agents, research agents, and task agents to work. GEMS demonstrates that this architecture actually delivers measurable improvements even at small model scales. The paper is from a team of seven researchers and the code is open-source on GitHub.
https://arxiv.org/abs/2603.28088
https://github.com/lcqysl/GEMS
https://gems-gen.github.io
← Back to all articles
The framework has three components. Agent Loop runs a multi-agent system that critiques and refines generation quality through closed-loop optimization. Think of it as the model arguing with itself until the output is actually good. Agent Memory stores successful trajectories so the system does not repeat mistakes or rediscover solutions it already found. Agent Skill is an extensible library of domain-specific capabilities that the system accumulates over time.
The result is a 6B model that exceeds state-of-the-art performance on GenEval2, a benchmark for text-to-image generation quality. A model one-tenth the size of its competitors winning through architecture rather than brute force.
This matters beyond image generation. The pattern of agent loop plus persistent memory plus skill accumulation is exactly how we want coding agents, research agents, and task agents to work. GEMS demonstrates that this architecture actually delivers measurable improvements even at small model scales. The paper is from a team of seven researchers and the code is open-source on GitHub.
https://arxiv.org/abs/2603.28088
https://github.com/lcqysl/GEMS
https://gems-gen.github.io
Comments