Recent research led by Alex Lu and collaborators at Microsoft highlights important limitations facing zero-shot foundation models in single-cell biology, challenging the prevailing excitement surrounding their use. Foundation models, such as those inspired by ChatGPT, have garnered interest within scientific communities for their potential to unravel complex biological data. However, findings from the study ´Assessing the Limits of Zero-shot Foundation Models in Single-cell Biology,´ discussed on the Abstracts podcast, show that when these models are evaluated in zero-shot settings—where models are used without fine-tuning—they perform worse than simpler, established statistical and machine learning approaches long used by biologists.
Lu explains that single-cell foundation models claim to offer deeper insights into large datasets generated from measuring gene expression in individual cells, an essential process in understanding cellular differentiation and drug discovery. Despite these claims, the team’s zero-shot evaluations reveal that the models often fail to outperform baseline techniques like scVI and Harmony—methods biologists have relied upon to interpret complex gene expression data. The study emphasizes that, unlike traditional tasks where models can be fine-tuned with labeled data, biological discovery frequently encounters unknowns that prohibit standard fine-tuning, making zero-shot performance a critical benchmark.
The research represents a methodological shift for the field: prior single-cell model assessments heavily focused on fine-tuning, potentially inflating claims of their practical benefits. By extracting and analyzing the models’ internal representations without adaptation, Lu’s team found that the expected foundational biological knowledge was lacking, with classic approaches offering more reliable results for discovery. The study underscores the importance of tailoring Artificial Intelligence evaluation to fit the unique challenges of biology, and encourages methodologists to adopt more rigorous, context-aware benchmarks. As more groups follow these practices, the findings pave the way for improved model development and more credible applications of Artificial Intelligence in the life sciences, driving home the message that context, methodology, and evaluation must all evolve together for meaningful progress.