Adaptive Layerwise Perturbation, or ALP, is presented as a way to address policy staleness and mismatches between training and inference in large language model reinforcement learning. These off-policy problems emerge as inference efficiency improvements widen the distribution gap, producing heavy-tailed importance ratios that can destabilize training. The resulting instability threatens the reliability of systems that depend on consistent model behavior during reinforcement learning updates.
ALP works by introducing small, learnable perturbations into the hidden states of each layer during updates, creating a perturbed policy that is used as the numerator in the importance ratio while the inference policy remains unaltered. By injecting controlled noise into intermediate representations, the method keeps the updated policy from drifting too far. This expands the policy family to include inference-time mismatches, narrows the gap between updated and inference policies, and reduces the tail behavior of importance ratios that can otherwise undermine training.
The practical significance lies in its effect on training dynamics. Policy staleness can inflate gradients and push updates outside acceptable trust regions, producing erratic outcomes. ALP is designed to counter those effects and preserve stable optimization. Stable training is framed as essential for dependable deployment of large language models in real-world reinforcement learning settings, where poor update behavior can directly weaken system performance.
Experimental results on single-turn math and multi-turn tool-integrated reasoning tasks suggest that ALP improves performance while also mitigating the blow-up of importance-ratio tails and dampening spikes in KL divergence during iterative training. The findings also indicate that perturbations applied across all layers outperform partial-layer and logits-only alternatives. That result points to the value of representation-level intervention throughout the model, rather than narrower adjustments focused on only part of the network.