Organisations are sitting on large quantities of structured data in relational databases and spreadsheets that remain underused, because specialists still spend much of their time on repetitive work such as cleaning tables, extracting features and linking datasets. Dutch researcher Madelon Hulsebos, based at the Centrum Wiskunde & Informatica (CWI) in the Netherlands, is tackling this by developing “table representation learning”, a method that enables artificial intelligence to interpret what tables mean rather than simply search them by column names. After a PhD at the University of Amsterdam and postdoctoral work at the University of California, Berkeley, she now leads the Table Representation Learning Lab at CWI, guiding a team of three PhD students, two postdocs and six master’s students.
Backed by an NWO AiNed Fellowship Grant under the National Growth Fund programme, Hulsebos launched the DataLibra project, which runs from 2024 to 2029 and aims to build practical tools that make querying organisational data as simple as a web search. She argues that artificial intelligence can lower the barrier by allowing users to ask questions in natural language rather than learning programming, business intelligence tools and relational database concepts. The challenge is that each system uses different column names and logic, which limits traditional techniques such as SQL and pattern matching, so her work focuses on models that generalise from context in order to identify and combine relevant tables, moving from basic information retrieval to what she terms “insight retrieval”. Hulsebos stresses that full automation is not the goal, because users must be able to understand and explain why a specific answer was produced, making transparency, iteration and robustness central design requirements.
Hulsebos sees table representation learning as a way to automate the commonly cited “80% data work and 20% modelling” split in data science, freeing experts to focus on more critical questions while also empowering non-specialists to query relational databases directly in plain language. She is sceptical of many current vendor claims about artificial intelligence powered analytics, pointing to benchmarks where success rates are often zero and highlighting the need for systems that can justify their outputs rather than simply generate confident responses. The importance of context and explanation came into focus in a recent collaboration with the United Nations Humanitarian Data Centre on detecting sensitive data in humanitarian datasets. Together with master’s student Liang Telkamp, she developed two mechanisms: one that reasons over full data context to reduce false positives, and a “retrieve then detect” approach that dynamically links datasets to relevant policies and protocols so that assessments change as conflicts or situations evolve. Quality Assessment Officers at the UN found the contextualised explanations from large language models particularly valuable for navigating long information sharing protocols, and Telkamp’s work was recognised with the Amsterdam AI Thesis Award.
For Hulsebos, the UN project illustrates the broader organisational problem of making data both accessible and comprehensible, including understanding sensitivities before information is published on data sharing portals that could feed model training sets. She wants to surface unknown datasets and combinations so people can uncover insights they did not realise were possible, reducing the need to route every question through business intelligence or data science teams. In her view, dependence on dashboards and SQL queries introduces delays until a delivered insight is no longer timely, so she focuses on artificial intelligence powered systems that shorten “speed to insight” by allowing everyone from sales staff to CEOs to query data directly. Concrete tools are in development: one PhD student is building components to automate dataset retrieval and support structured query language generation, with first open source versions expected within the next two months. An earlier tool, DataScout, created during her time at the University of California, Berkeley, already showed in user studies that task based search with large language models helped data scientists find relevant datasets faster than traditional keyword based data platforms, addressing situations where gathering the right data for a machine learning model could otherwise take two weeks to a month.
