#policy-optimization
- DRPO: Rethinking Divergence Regularization in LLM Reinforcement Learning Tencent Hunyuan research
- Flow-DPPO: Principled RL Alignment for Flow Matching Image and Video Models Tencent Hunyuan research
- ZPPO: Teacher-in-Prompts Knowledge Distillation Outperforms Gradient Methods for Small Reasoners NVIDIA research