Leverage the Average: an Analysis of KL Regularization in RL Paper β’ 2003.14089 β’ Published Mar 31, 2020 β’ 2
Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice Paper β’ 2305.13185 β’ Published May 22, 2023
Gemini: A Family of Highly Capable Multimodal Models Paper β’ 2312.11805 β’ Published Dec 19, 2023 β’ 49
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes Paper β’ 2306.13649 β’ Published Jun 23, 2023 β’ 32
Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View Paper β’ 2401.11237 β’ Published Jan 20, 2024
MusicRL: Aligning Music Generation to Human Preferences Paper β’ 2402.04229 β’ Published Feb 6, 2024 β’ 17
Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games Paper β’ 2212.14449 β’ Published Dec 29, 2022
Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion Paper β’ 2406.19185 β’ Published Jun 27, 2024
Imitating Language via Scalable Inverse Reinforcement Learning Paper β’ 2409.01369 β’ Published Sep 2, 2024
Solving robust MDPs as a sequence of static RL problems Paper β’ 2410.06212 β’ Published Oct 8, 2024
Understanding Likelihood Over-optimisation in Direct Alignment Algorithms Paper β’ 2410.11677 β’ Published Oct 15, 2024 β’ 1