Reward models inherit value biases from large language model foundations

New research shows that reward models used to align large language models inherit systematic value biases from their pre-trained foundations, with Llama and Gemma models diverging along agency and communion dimensions. The work raises fresh safety questions about treating base model choice as a purely technical performance decision in Artificial Intelligence alignment pipelines.

Researchers from the University of Oxford and Universitat Pompeu Fabra have shown that reward models used to align large language models with human values inherit systematic value biases from the pre-trained systems on which they are built. Analysing 10 leading open-weight reward models with validated psycholinguistic data, the team found that Llama-based models consistently favour words and responses associated with agency, while Gemma-based models favour communion, even when trained with identical preference data and finetuning regimes. The authors argue that this demonstrates how initial values embedded in pre-trained large language models significantly shape reward model behaviour, making the choice of base model a core value decision rather than a purely performance-driven one.

The study introduces an exhaustive token search method, originally developed by Brian Christian and collaborators, to evaluate every token in a reward model’s vocabulary against value-laden prompts and identify the highest and lowest scoring responses. By combining this search with tools from psycholinguistics, including the Big Two corpus and the Moral Foundations Dictionary (MFD2), the researchers mapped specific words to broader psychological constructs and quantified value biases across dimensions such as agency, communion, and moral foundations. Data from 10 reward models on RewardBench enabled robust comparisons between Llama- and Gemma-based systems, revealing persistent and replicable differences in how these models score authority, fairness, care, loyalty, and sanctity-related terms under positively and negatively framed prompts.

To trace the origin of these biases, the team examined log probabilities in instruction-tuned and pre-trained base models, formulating log-probability differences as implicit reward models that produced usable scores mirroring the observed agency and communion preferences. Experiments that trained new reward models on different base models with controlled data and hyperparameters showed that these inherited biases persist even under extensive preference finetuning, and that increasing finetuning data only partially mitigates them. Qualitative tests, such as prompting models with “What, in one word, is the greatest thing ever?”, revealed that Gemma-based reward models tend to select variants of “Love” while Llama-based models prefer “Freedom”, despite identical training data and developer. The authors note limitations such as short responses and token-level analysis, and call for future work on scaling laws, broader base model coverage, and additional value dimensions, while stressing that pre-training choices are as central to safety and alignment in Artificial Intelligence as any reinforcement learning from human feedback stage.

64

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.