Medex: a dataset for distilling knowledge priors from literature for therapeutic design

Artificial Intelligence-driven discovery can reduce design time but may violate experimental priors. Medex is a literature-derived dataset of design priors containing 32.3 million pairs of natural language facts and entity representations to support safer therapeutic design.

Artificial Intelligence-driven discovery can shorten design cycles but risks proposing candidates that violate experimental constraints when models lack laboratory priors. The authors note that, in a new analysis across diverse models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had a high probability of being mutagenic. To address this gap they introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings.

Medex is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. The dataset consists of 32.3 million pairs of natural language facts and appropriate entity representations, including SMILES or RefSeq IDs. The paper frames these pairs as knowledge priors that can be incorporated into model pretraining or used as constraints during optimization. The authors publish the dataset on HuggingFace at https://huggingface.co/datasets/DocAndDesign/Medex and say they will provide expanded versions as the available literature grows.

To demonstrate the dataset’s utility, the team trains LLM, CLIP, and LLaVA architectures to reason jointly about text and design targets and evaluates them on tasks from the Therapeutic Data Commons. In supervised prediction problems that use Medex for pretraining, their best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks and perform comparably to 9B models on average. The authors also show that models built with Medex can be applied as constraints while optimizing for novel molecules in GuacaMol, producing proposals that are safer and nearly as effective. The release aims to enable models with stronger experimental priors and to improve safety in computational therapeutic design.

62

Impact Score

How high quality sound shapes virtual communication and trust

As virtual meetings, classes, and content become routine, researchers and audio leaders argue that sound quality is now central to how we judge credibility, intelligence, and trust. Advances in Artificial Intelligence powered audio processing are making clear, unobtrusive sound both more critical and more accessible across work, education, and marketing.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.