#rl
- VibeThinker-3B Reaches Frontier-Level Reasoning Benchmarks via Curriculum RL WeiboAI research
- Exploration Hacking: LLMs Can Be Fine-Tuned to Strategically Resist RL Training research
- OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training Across Six Models OpenAI research
- Google DeepMind's AI Co-Mathematician Reaches 48% on FrontierMath Tier 4 Google DeepMind research
- Flow-OPD: On-Policy Distillation Pushes GenEval +29 Points on Stable Diffusion 3.5 research
- RubricEM: Meta-RL with Rubric-Guided Policy Decomposition Beyond Verifiable Rewards Google research
- SU-01: Gold-Medal-Level Olympiad Reasoning via Curriculum SFT and Two-Stage RL SU-01 Team research
- SkillsVote: Lifecycle Governance of Agent Skills — Collection, Recommendation, Evolution (219 HF upvotes) Memtensor Research Group / IAAR-Shanghai research
- Anthropic Eliminates Claude's Agentic Blackmail Behavior via 'Teaching Claude Why' Anthropic research
- DRPO: Rethinking Divergence Regularization in LLM Reinforcement Learning Tencent Hunyuan research
- Learning while Deploying: Fleet-Scale Reinforcement Learning Turns Robot Deployment into Continuous Training AGIBot research
- Ctx2Skill: Self-Improving Framework for Autonomous Context-Skill Discovery in LLMs research
- RLDX-1: Multi-Stream Action Transformer Achieves 86.8% on ALLEX Humanoid Tasks RLWRLD research
- OpenSearch-VL: Open Recipe for Training Frontier Multimodal Search Agents Tencent Hunyuan research
- SDAR: Self-Distilled Agentic Reinforcement Learning for Multi-Turn Agents Zhejiang University / Meituan research
- GrepSeek: Training Search Agents for Direct Corpus Interaction via Shell Commands (93 HF Upvotes) University of Massachusetts Amherst research
- ThoughtFold: Introspective Preference Learning Cuts Reasoning Tokens by 56% Without Accuracy Loss research
- Agentic Transformers Provably Learn Depth-First Search via Reinforcement Learning Carnegie Mellon University / Ohio State University research
- Flow-DPPO: Principled RL Alignment for Flow Matching Image and Video Models Tencent Hunyuan research
- Arbor: Generalist Autonomous ML Research via Hypothesis-Tree Refinement NLPIR Lab research
- Z-Reward: Score Distributions Instead of Scalar Rewards for Image Generation RLHF Alibaba research
- InterleaveThinker: RL Planner+Critic Pipeline for Interleaved Text-and-Image Generation CUHK Multimedia Lab research
- CoPD: co-evolving policy distillation for unified multi-capability models research
- Odysseus: Training VLMs for 100+ Turn Interactive Decision-Making via RL Princeton University research
- TrOPD: Trust-Region On-Policy Distillation Stabilizes LLM Training When Teacher-Student Gap Is Large Samsung Research research
- InterleaveThinker: RL Framework for Agentic Text-and-Image Interleaved Generation research
- FORT-Searcher: Shortcut-Resistant Training Data Framework for Deep Search Agents research
- Astra: RL-Trained VLM Queries World Simulator for Spatial Reasoning research
- HeavySkill: Internalizing Heavy Thinking as a Trainable Agentic Skill via RL research
- NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized AI Research Automation Shanghai AI Lab research
- TMAS: Scaling Test-Time Compute via Multi-Agent Synergy with Hierarchical Memory research
- BetaPRM: Uncertainty-Aware Process Rewards Cut Reasoning Token Use by 33% research
- NudgeRL: Strategy-Level Context Nudges for Efficient RLVR Exploration KAIST AI research
- QUBRIC: Co-Designing Queries and Rubrics Extends RLVR to Open-Ended Reasoning Domains research
- On the Geometry of On-Policy Distillation: A Training Paradigm Distinct from SFT and RLVR Hong Kong University of Science and Technology research
- Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight Rutgers University research
- ZPPO: Teacher-in-Prompts Knowledge Distillation Outperforms Gradient Methods for Small Reasoners NVIDIA research
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond HKUST/NUS/Oxford/NTU research
- World-R1: Reinforcing 3D Constraints for Text-to-Video Generation Microsoft Research research