policy-optimization — AI Digest

10 июн DRPO: Rethinking Divergence Regularization in LLM Reinforcement Learning Tencent Hunyuan research
10 июн Flow-DPPO: Principled RL Alignment for Flow Matching Image and Video Models Tencent Hunyuan research
17 июн ZPPO: Teacher-in-Prompts Knowledge Distillation Outperforms Gradient Methods for Small Reasoners NVIDIA research