Continual learning with reinforcement learning for large language models

Researchers are finding that on-policy reinforcement learning can help large language models learn new tasks over time while preserving prior skills, outperforming supervised finetuning in continual learning setups. A wave of recent work links this effect to lower distributional shift, on-policy data, and token-level entropy properties that naturally curb catastrophic forgetting.

The article explores how continual learning, defined as an Artificial Intelligence model’s ability to absorb new tasks and data without losing existing skills, intersects with modern large language model training. Classical continual learning research highlights catastrophic forgetting, where performance on old tasks degrades as new tasks are learned. Large language models complicate this further because they are trained at massive scale and typically lack access to their original pretraining data, making standard replay and buffer-based approaches difficult to apply. The author surveys experimental frameworks such as batch-incremental and streaming setups, explains how non-IID data induces forgetting, and reviews common mitigation techniques including replay buffers, knowledge distillation, regularization, architectural extensions like LoRA or mixture-of-experts, and multi-task joint training as a performance ceiling.

The core focus is on how standard post-training methods behave under continual learning: supervised finetuning versus on-policy reinforcement learning. By analyzing their objectives through the lens of Kullback-Leibler divergence, supervised finetuning is described as a forward KL, mode-covering objective over an offline dataset, while reinforcement learning is framed as a reverse KL, mode-seeking objective that optimizes rewards on on-policy samples, optionally regularized by KL to a reference model. In continual post-training experiments on multi-modal benchmarks, supervised finetuning reaches an average accuracy of 54% versus 62.9% for multi-task training and exhibits a forgetting measure of -10.4%, while reinforcement learning with GRPO attains an average accuracy of 60% and an FM of -2.3%, retaining ScienceQA accuracy at 93% compared to a peak of 95.6%. Reinforcement learning also maintains or slightly improves general benchmark performance, for example increasing accuracy on MMMU from 52.1% to 54.2%, whereas supervised finetuning degrades general capabilities.

Follow-up work isolates why reinforcement learning forgets less by comparing domain adaptation setups and systematically varying learning algorithms. Across Qwen-2.5 and Llama-3 models up to 8B parameters, reinforcement learning yields <1% average accuracy drop across non-target tasks, while supervised finetuning can cause nearly 30% average accuracy drop, even when target task performance is higher. Experiments show that this robustness is not primarily due to explicit KL regularization or chain-of-thought reasoning, but rather to the use of on-policy data, which biases updates toward low KL divergence between base and finetuned models on the target distribution. A metric termed distributional shift, measured as the KL divergence between base and finetuned models over target-task data, is found to reliably predict forgetting across settings. On-policy supervised finetuning and iterative supervised finetuning using approximately on-policy samples approach reinforcement learning’s forgetting profile, reinforcing the conclusion that online sampling, not just the optimization algorithm, drives the effect.

The article further investigates token-level dynamics to explain why offline supervised finetuning is more destructive. Measuring token probabilities and predictive entropy reveals a cluster of “confident conflicts” in supervised data: tokens with low entropy but low probability under the base model, representing strong mismatches between external supervision and the model’s prior beliefs. Fitting these tokens requires large, disruptive updates that overwrite general representations, whereas reinforcement learning’s on-policy rollouts avoid such conflicts or treat them as exploratory, high-entropy cases with gentler gradients. Masking confident conflict tokens during supervised finetuning significantly reduces forgetting, leading to Entropy-Adaptive Finetuning (EAFT), which scales the cross-entropy loss by normalized token entropy, effectively down-weighting low-entropy conflicts. EAFT, implemented with Top-K entropy for efficiency, improves math-domain performance on datasets like NuminaMath, BigMathVerified, and Nemotron-CrossThink while preserving general benchmark scores, and similar gains appear in medical and tool-use domains.

Beyond retention, the article reviews evidence that reinforcement learning enhances cross-domain generalization. In synthetic language-only and vision-language tasks such as GeneralPoints and V-IRL, supervised finetuning boosts in-domain performance but can degrade out-of-distribution accuracy by as much as 79.5%, whereas reinforcement learning improves out-of-distribution performance by margins like 3.5% and 11.0% for language-only variants and up to 9.3% for vision-language variants. Reinforcement learning also strengthens underlying perceptual abilities, such as recognizing card values from images, indicating that reward-driven updates refine both reasoning and perception. In complementary reasoning setups built from controlled knowledge graphs, reinforcement learning applied after an initial supervised finetuning phase synthesizes atomic reasoning skills into robust compositional and zero-shot reasoning strategies, while pure supervised finetuning tends to memorize patterns and fails to generalize to unseen relational paths. Large-scale math reasoning studies similarly find that math-focused supervised finetuning offers weak transfer to non-reasoning tasks, whereas reinforcement-learning-based math training improves both reasoning and non-reasoning benchmarks, again linked to on-policy data and the presence of negative gradients in the reward objective.

Across these studies, a consistent picture emerges: on-policy reinforcement learning offers a naturally conservative update regime that implicitly regularizes toward low distributional shift, avoids high-impact conflicts from misaligned supervision, and thereby mitigates catastrophic forgetting in large language models. While real-world continual learning remains far messier than current structured benchmarks, the combination of on-policy sampling, reverse-KL-style optimization, and entropy-aware updates appears to give reinforcement learning a structural advantage for building adaptable systems that retain and generalize their capabilities over time. This alignment between reinforcement learning’s mechanics and the demands of continual learning suggests that continuing to scale and refine reinforcement-learning-based post-training could play a central role in pushing large language models toward more general, continually improving Artificial Intelligence.

65

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.