Reward models inherit value biases from large language model foundations

New research shows that reward models used to align large language models inherit systematic value biases from their pre-trained foundations, with Llama and Gemma models diverging along agency and communion dimensions. The work raises fresh safety questions about treating base model choice as a purely technical performance decision in Artificial Intelligence alignment pipelines.

Researchers from the University of Oxford and Universitat Pompeu Fabra have shown that reward models used to align large language models with human values inherit systematic value biases from the pre-trained systems on which they are built. Analysing 10 leading open-weight reward models with validated psycholinguistic data, the team found that Llama-based models consistently favour words and responses associated with agency, while Gemma-based models favour communion, even when trained with identical preference data and finetuning regimes. The authors argue that this demonstrates how initial values embedded in pre-trained large language models significantly shape reward model behaviour, making the choice of base model a core value decision rather than a purely performance-driven one.

The study introduces an exhaustive token search method, originally developed by Brian Christian and collaborators, to evaluate every token in a reward model’s vocabulary against value-laden prompts and identify the highest and lowest scoring responses. By combining this search with tools from psycholinguistics, including the Big Two corpus and the Moral Foundations Dictionary (MFD2), the researchers mapped specific words to broader psychological constructs and quantified value biases across dimensions such as agency, communion, and moral foundations. Data from 10 reward models on RewardBench enabled robust comparisons between Llama- and Gemma-based systems, revealing persistent and replicable differences in how these models score authority, fairness, care, loyalty, and sanctity-related terms under positively and negatively framed prompts.

To trace the origin of these biases, the team examined log probabilities in instruction-tuned and pre-trained base models, formulating log-probability differences as implicit reward models that produced usable scores mirroring the observed agency and communion preferences. Experiments that trained new reward models on different base models with controlled data and hyperparameters showed that these inherited biases persist even under extensive preference finetuning, and that increasing finetuning data only partially mitigates them. Qualitative tests, such as prompting models with “What, in one word, is the greatest thing ever?”, revealed that Gemma-based reward models tend to select variants of “Love” while Llama-based models prefer “Freedom”, despite identical training data and developer. The authors note limitations such as short responses and token-level analysis, and call for future work on scaling laws, broader base model coverage, and additional value dimensions, while stressing that pre-training choices are as central to safety and alignment in Artificial Intelligence as any reinforcement learning from human feedback stage.

64

Impact Score

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Artificial Intelligence models split on job disruption estimates

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.