Artificial Intelligence-driven discovery can shorten design cycles but risks proposing candidates that violate experimental constraints when models lack laboratory priors. The authors note that, in a new analysis across diverse models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had a high probability of being mutagenic. To address this gap they introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings.
Medex is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. The dataset consists of 32.3 million pairs of natural language facts and appropriate entity representations, including SMILES or RefSeq IDs. The paper frames these pairs as knowledge priors that can be incorporated into model pretraining or used as constraints during optimization. The authors publish the dataset on HuggingFace at https://huggingface.co/datasets/DocAndDesign/Medex and say they will provide expanded versions as the available literature grows.
To demonstrate the dataset’s utility, the team trains LLM, CLIP, and LLaVA architectures to reason jointly about text and design targets and evaluates them on tasks from the Therapeutic Data Commons. In supervised prediction problems that use Medex for pretraining, their best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks and perform comparably to 9B models on average. The authors also show that models built with Medex can be applied as constraints while optimizing for novel molecules in GuacaMol, producing proposals that are safer and nearly as effective. The release aims to enable models with stronger experimental priors and to improve safety in computational therapeutic design.
