LLM Fine-Tuning
RVPO: Stabilizing LLM Fine-Tuning Through Variance Control
New research introduces Ratio-Variance Regularized Policy Optimization (RVPO), a method that stabilizes reinforcement learning from human feedback by controlling importance sampling variance in LLM training.