Flow-DPPO: Principled RL Alignment for Flow Matching Image and Video Models

Tencent Hunyuan

Research official 1 src. ~1 min

Flow-DPPO (arXiv:2606.11025) argues that ratio-clipping PPO variants (Flow-GRPO, CPS) are structurally ill-suited for flow matching models because noisy per-step policy ratios produce inconsistent trust-region enforcement across trajectory positions. Flow-DPPO replaces ratio clipping with a divergence-based proximal constraint and leverages the Gaussian structure of per-step flow policies to compute exact KL divergences efficiently. Demonstrates superior reward, better KL efficiency, reduced catastrophic forgetting, and stable multi-epoch training on image and video generation tasks.

Why it matters

Applying RL alignment to generative image/video models is an active frontier. Flow-DPPO provides a theoretically principled alternative to ratio-clipping designed specifically for the continuous-time flow matching paradigm now used in most SOTA diffusion models.

Importance: 3/5

Notable research paper from Tencent Hunyuan; principled RL for flow-matching models fills a theoretical gap; relevant to video/image generation alignment.

rl flow-matching diffusion policy-optimization video-generation

Sources

official arXiv:2606.11025 — Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models