ALP targets stability in large language model reinforcement learning
Adaptive Layerwise Perturbation aims to reduce policy staleness and training mismatches in large language model reinforcement learning. The method improves stability by constraining policy drift and reducing harmful importance-ratio tails.
