Insurance pricing is emerging as a rich field for applying large language models in feature engineering, allowing actuaries to generate new predictive variables from existing data and external sources. Feature engineering is framed as adding new columns to datasets that better describe each observation, with large language models expanding what is possible by embedding their own learned knowledge and by processing unstructured inputs such as text and images. Despite concerns about hallucination, the inherent validation steps in model fitting provide a safeguard, shifting the key risks from data scarcity and anti-selection toward governance, bias and operational controls.
The approach is broken into four main types of large language model derived features. First, factual descriptors use models as scalable domain experts to assign ordinal risk groupings or answer detailed questions about attributes, such as model-specific car features, across thousands of levels in seconds instead of hours. Second, subjective descriptors extract broad, socially informed judgments, for example identifying “boy racer” cars that carry higher risk but lack an explicit, stable list, replacing weeks of manual sentiment analysis. Third, interaction-style features classify observations across combinations of existing variables, effectively flagging high risk patterns and helping close the interaction gap between generalised linear models and tree-based machine learning methods. Fourth, multimodal models can distil large volumes of unstructured external data, such as property images similar to Google Street View, into rich signals about roof condition, maintenance, surroundings or even lifestyle proxies that traditional pricing models cannot easily capture.
Implementation starts with clear thinking about true underlying risk factors, such as driving ability or propensity to take risks in motor insurance, and then crafting prompts that tie new features intuitively to those factors. Practically, actuaries send factor levels to an application programming interface with prescribed response scales, then convert results into mapping tables that can be merged into modelling datasets, as illustrated by a car model example where a copilot tool outputs risk, “boy racer” likelihood and coolness scores. Static mappings are usually preferred for cost and speed, with real time scoring reserved for cases where new levels appear frequently, such as addresses. New features are tested with standard statistical validation and dropped if they do not improve performance, though incorrect classification at the individual level can increase price volatility. The method also demands stringent ethical and legal scrutiny: letting models infer features from names or personal information is flagged as unacceptable, and there is explicit concern that stereotypes and protected class differences embedded in training data will taint features like perceived speeding propensity. As pricing sophistication and segmentation increase, affordability pressure on higher risk groups is expected to rise, likely inviting more regulatory attention, while reliance on external data vendors may fall as internal teams use large language model tooling. At the same time, opaque, biased large language model derived factors are identified as a significant operational risk for fair pricing, even as they offer a potential lifeline for traditional generalised linear models by enriching their feature space.
