CoPD: co-evolving policy distillation for unified multi-capability models

Research official + media 2 src. ~1 min

CoPD trains specialized expert policies in parallel and runs distillation simultaneously during their development, so experts mutually teach one another instead of being trained sequentially and then merged. The approach combines text, image, and video reasoning into one model, beating both mixed RLVR and sequential expert-then-distill baselines, and even single-domain experts.

Why it matters

Addresses a practical failure mode of RLVR-style training: when you try to teach one model many capabilities at once you get inter-capability conflict, but sequential training plus distillation leaves a behavioral gap. Co-evolution is a clean answer that targets unified multi-capability frontier models.

Importance: 2/5

Solid method paper, default importance.

rl multimodal paper

Sources

official Co-Evolving Policy Distillation

media HuggingFace Daily Papers entry