Researchers from Northwestern University and American University found that large language models do not reliably agree when asked to judge which occupations Artificial Intelligence is likely to disrupt. In a working paper published by the National Bureau of Economic Research, the team argued that current estimates of job exposure can shift dramatically depending on which model is used, making the results fragile for policymakers, economists, and workforce planners.
The team tested four frontier Artificial Intelligence systems, GPT-4, ChatGPT-5, Gemini 2.5, and Claude 4.5, using the same rubric to rate nearly 19,000 work tasks. The results showed deep disagreement. Mean exposure scores ranged from 0.14 (GPT-4 and Gemini) to 0.51 (Claude), a 3.6-fold difference. Pairwise agreement between models fell as low as 57%, which researchers called only “fair”. The largest disagreements occurred in occupations that mix cognitive and physical duties, such as management, teaching, and sales. Management roles ranged from roughly 0.08 (Gemini) to 0.83 (Claude). Computer and mathematical occupations ranged from 0.42 (Gemini) to 0.95 (Claude). Educational instruction, life sciences, and sales all showed spreads of 0.30 or more across annotators.
The models were more aligned at the extremes. Physical jobs like construction were generally rated as relatively safe, while coding-related work was broadly seen as vulnerable. The sharpest uncertainty appeared in white-collar occupations in the middle, where model judgments diverged substantially and produced conflicting pictures of likely disruption.
The inconsistency also altered downstream economic conclusions. At the county level, Claude 4.5 produced a statistically significant negative relationship between Artificial Intelligence exposure and employment. In contrast, GPT-4, ChatGPT-5, and Gemini 2.5 all found no significant effect, with Gemini even yielding a positive, though insignificant, coefficient. At the individual level, all models gave significant negative results, but magnitudes varied: Gemini showed the largest effect, 2.4 times the original GPT-4 estimate.
The researchers said conclusions about whether large language model exposure reduces employment, and by how much, depend on an often unreported choice of which model performed the task ratings. They argued that asking Artificial Intelligence systems to assess their own capabilities is circular and called for a shift toward measures based on actual Artificial Intelligence usage data rather than self-referential model judgments.
