The human engine of artificial intelligence: datawork and LLM performance

Adriana Alvarado of IBM argues that datawork - the human processes of collecting, curating and documenting datasets - is the core driver of artificial intelligence performance and fairness. Her talk explores why Large Language Models amplify the stakes of these human choices and how synthetic data and robust governance fit into the solution.

“Datawork is deeply human,” says Adriana Alvarado, staff research scientist at IBM, in her presentation “LLM + Data: Building Artificial Intelligence with Real & Synthetic Data.” She framed data as the engine behind every model, from simple algorithms to the largest Large Language Models. Alvarado stressed that ongoing human decisions about collection, annotation and preparation shape an Artificial Intelligence system’s performance, fairness and utility, and that these details are often obscured by technical narratives.

Alvarado uses the term datawork to describe the lifecycle activities that produce and maintain datasets: collecting, annotating, curating, deploying and iteratively refining data. She cautioned founders and investors that data is not a static commodity but a dynamic, human-shaped resource. Because these choices have downstream effects on model behavior, datawork is a strategic differentiator that is nonetheless undervalued and frequently invisible within broader Artificial Intelligence development processes.

The human element in datawork introduces representational bias. Alvarado noted that many datasets do not represent the world equally, over-representing some regions, languages and perspectives while under-representing others. Those labeling and categorization decisions implicitly decide who is represented. As a result, even advanced Large Language Models trained on skewed data can perpetuate inequalities and perform poorly for underrepresented users or applications, raising ethical and practical risks for product deployments and global markets.

To address scarcity, privacy and gaps, practitioners increasingly explore synthetic data generated by Large Language Models. Alvarado warned that synthetic data is not a cure-all: every synthetically generated dataset requires detailed documentation of seed data, prompts and parameter settings to preserve provenance and enable accountability. She concluded with three forward-looking points: specialized datasets matter, scale alone does not ensure diversity or quality, and dataset categories must reflect real user needs and application conditions. The future of robust, ethical and performant Artificial Intelligence, she argued, depends as much on meticulous datawork as on algorithmic advances.

55

Impact Score

Most UK firms see Artificial Intelligence training gap as shadow tool use grows

New research finds that 6 in 10 UK businesses say employees lack comprehensive Artificial Intelligence training, even as shadow use of unapproved tools becomes widespread and investment surges. Executives warn that without stronger skills, governance and strategy, many organisations risk missing out on expected Artificial Intelligence returns.

COSO issues internal control roadmap for governing generative artificial intelligence

COSO has released governance guidance that applies its Internal Control-Integrated Framework to generative artificial intelligence, offering audit-ready control structures and implementation tools for organizations. The publication details capability-based risk mapping, aligned controls, and practical templates to help institutions manage emerging technology risks.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.