Participants debate whether synthetic data produced by large language models represents genuinely new information or simply a recombination of existing material. One commenter describes the process as “compression and filtering,” arguing that models use training data to distill and restate what is already known, rather than generating fundamentally new information. From this perspective, progress in model capabilities ultimately requires more raw information, such as new empirical measurements, proofs, media streams, or original human work, not just more synthetic outputs derived from a fixed corpus.
Another commenter counters that the synthetic outputs do constitute new data that did not exist before and encourages closer examination of recent research suggesting value in this approach. A detailed thought experiment is presented: an Artificial Intelligence model is trained on a small but comprehensive corpus, such as all content in the library of congress, then used to author new works. A second model is trained on the original corpus plus these outputs, prompting the question of whether this loop truly addresses the scaling problem, even if more parameters and more GPU resources are added. The critic argues that the generated texts contain no new information content and therefore cannot support unbounded scaling of model capacity.
Supporters of synthetic training data shift the focus from information quantity to information structure, especially for logical reasoning. They argue that the core difficulty lies in models learning spurious correlations between context and next tokens during standard training, which undermines robust, general reasoning. Synthetic data can help because randomized and constructed examples reduce these spurious associations and pressure the model to internalize correct generic reasoning steps. Citing DeepSeek and its Qwen 8B model as evidence of the power of such techniques, a commenter still concedes that the total volume of useful synthetic data is inherently bounded by the underlying raw information, so it cannot fully resolve the scaling problem. Others note that any benefit from synthetic data depends on having scalable ways to verify the quality of generated outputs, which is itself challenging without introducing genuinely new external information sources.
