Flow-DPPO: Principled RL Alignment for Flow Matching Image and Video Models
Tencent Hunyuan
Flow-DPPO (arXiv:2606.11025) argues that ratio-clipping PPO variants (Flow-GRPO, CPS) are structurally ill-suited for flow matching models because noisy per-step policy ratios produce inconsistent trust-region enforcement across trajectory positions. Flow-DPPO replaces ratio clipping with a divergence-based proximal constraint and leverages the Gaussian structure of per-step flow policies to compute exact KL divergences efficiently. Demonstrates superior reward, better KL efficiency, reduced catastrophic forgetting, and stable multi-epoch training on image and video generation tasks.
Why it matters
Applying RL alignment to generative image/video models is an active frontier. Flow-DPPO provides a theoretically principled alternative to ratio-clipping designed specifically for the continuous-time flow matching paradigm now used in most SOTA diffusion models.
Importance: 3/5
Notable research paper from Tencent Hunyuan; principled RL for flow-matching models fills a theoretical gap; relevant to video/image generation alignment.