Artificial Intelligence models split on job disruption estimates

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Researchers from Northwestern University and American University found that large language models do not reliably agree when asked to judge which occupations Artificial Intelligence is likely to disrupt. In a working paper published by the National Bureau of Economic Research, the team argued that current estimates of job exposure can shift dramatically depending on which model is used, making the results fragile for policymakers, economists, and workforce planners.

The team tested four frontier Artificial Intelligence systems, GPT-4, ChatGPT-5, Gemini 2.5, and Claude 4.5, using the same rubric to rate nearly 19,000 work tasks. The results showed deep disagreement. Mean exposure scores ranged from 0.14 (GPT-4 and Gemini) to 0.51 (Claude), a 3.6-fold difference. Pairwise agreement between models fell as low as 57%, which researchers called only “fair”. The largest disagreements occurred in occupations that mix cognitive and physical duties, such as management, teaching, and sales. Management roles ranged from roughly 0.08 (Gemini) to 0.83 (Claude). Computer and mathematical occupations ranged from 0.42 (Gemini) to 0.95 (Claude). Educational instruction, life sciences, and sales all showed spreads of 0.30 or more across annotators.

The models were more aligned at the extremes. Physical jobs like construction were generally rated as relatively safe, while coding-related work was broadly seen as vulnerable. The sharpest uncertainty appeared in white-collar occupations in the middle, where model judgments diverged substantially and produced conflicting pictures of likely disruption.

The inconsistency also altered downstream economic conclusions. At the county level, Claude 4.5 produced a statistically significant negative relationship between Artificial Intelligence exposure and employment. In contrast, GPT-4, ChatGPT-5, and Gemini 2.5 all found no significant effect, with Gemini even yielding a positive, though insignificant, coefficient. At the individual level, all models gave significant negative results, but magnitudes varied: Gemini showed the largest effect, 2.4 times the original GPT-4 estimate.

The researchers said conclusions about whether large language model exposure reduces employment, and by how much, depend on an often unreported choice of which model performed the task ratings. They argued that asking Artificial Intelligence systems to assess their own capabilities is circular and called for a shift toward measures based on actual Artificial Intelligence usage data rather than self-referential model judgments.

52

Impact Score

Elon Musk loses OpenAI suit on statute of limitations

A jury and judge concluded Elon Musk filed his claims against OpenAI too late, ending the case on procedural grounds rather than the underlying dispute. Musk plans to appeal, arguing the court never ruled on whether OpenAI abandoned its nonprofit mission.

Anduril and Meta outline military smart glasses plans

Anduril has described how its military smart glasses work with Meta could let soldiers issue commands through voice, eye tracking, and taps while viewing battlefield data in real time. The effort spans an Army prototype program and a separate Anduril-designed helmet system, but both face major technical and operational hurdles.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.