The makeup of pretraining data strongly shapes the capabilities and limitations of large language models, yet that underlying data mixture often remains opaque. That lack of disclosure makes independent auditing difficult and limits efforts to understand model behavior and provenance. A new framework called LLMSurgeon is positioned as a post-hoc method for analyzing large language model pretraining data mixtures using only model-generated text.
The approach is built around Data Mixture Surgery, a formalization for estimating the domain-level distribution of a model’s pretraining corpus. Rather than relying on direct access to training data, the method treats the task as an inverse problem. Under a label-shift assumption, LLMSurgeon uses a calibrated soft confusion matrix to account for systematic domain confusion, then recovers the latent mixture prior. The goal is to identify what kinds of data shaped the model while working from outputs alone.
To evaluate the framework, the researchers created LLMScan, a recipe-verifiable benchmark built with open-source large language models whose pretraining mixtures are known. The benchmark is intended to test whether LLMSurgeon can recover domain mixtures under standardized and reproducible conditions. Reported results indicate high fidelity in recovering those mixtures, supporting the case for practical auditing of foundation models after training.
The work frames data opacity as a core obstacle for foundation model transparency and positions auditing as a necessary response. By linking generated text back to likely pretraining domains, LLMSurgeon aims to provide a more verifiable basis for examining how foundation models are built and what influences their behavior. The broader contribution is a structured path toward transparency in large language model auditing without requiring direct disclosure of the original corpus.
