Continual learning with reinforcement learning for large language models

Researchers are finding that on-policy reinforcement learning can help large language models learn new tasks over time while preserving prior skills, outperforming supervised finetuning in continual learning setups. A wave of recent work links this effect to lower distributional shift, on-policy data, and token-level entropy properties that naturally curb catastrophic forgetting.