CoPD: co-evolving policy distillation for unified multi-capability models
CoPD trains specialized expert policies in parallel and runs distillation simultaneously during their development, so experts mutually teach one another instead of being trained sequentially and then merged. The approach combines text, image, and video reasoning into one model, beating both mixed RLVR and sequential expert-then-distill baselines, and even single-domain experts.
Why it matters
Addresses a practical failure mode of RLVR-style training: when you try to teach one model many capabilities at once you get inter-capability conflict, but sequential training plus distillation leaves a behavioral gap. Co-evolution is a clean answer that targets unified multi-capability frontier models.
Importance: 2/5
Solid method paper, default importance.
Sources
official
Co-Evolving Policy Distillation