News

Apr 08, 2026 Out For-Value is accepted by ACL Main 2026, where we delve into the learning dynamics of SFT and introduce a forward-only data valuation framework that enables scalable and efficient value estimation for both LLMs and VLMs.
Jan 26, 2026 Two papers were accepted to ICLR 2026: one on token hidden rewards in reinforcement learning, and the other on resolving gradient explosion and vanishing in text-based models. Many thanks to my collaborators!
Dec 03, 2025 Our paper On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral is online! We investigate why GRPO fails within Search-R1 (a recent multi-turn agentic workflow powered by DeepSeek-R1), showing that LLD is also the root cause of GRPO failure in multi-turn, tool-integrated RL.
Sep 18, 2025 Out paper On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization is accepted by NeurIPS 2025, where we delve into the learning dynamics of GRPO and conduct an in-depth analysis of negative gradients.