Medex: a dataset for distilling knowledge priors from literature for therapeutic design

Artificial Intelligence-driven discovery can reduce design time but may violate experimental priors. Medex is a literature-derived dataset of design priors containing 32.3 million pairs of natural language facts and entity representations to support safer therapeutic design.

Artificial Intelligence-driven discovery can shorten design cycles but risks proposing candidates that violate experimental constraints when models lack laboratory priors. The authors note that, in a new analysis across diverse models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had a high probability of being mutagenic. To address this gap they introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings.

Medex is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. The dataset consists of 32.3 million pairs of natural language facts and appropriate entity representations, including SMILES or RefSeq IDs. The paper frames these pairs as knowledge priors that can be incorporated into model pretraining or used as constraints during optimization. The authors publish the dataset on HuggingFace at https://huggingface.co/datasets/DocAndDesign/Medex and say they will provide expanded versions as the available literature grows.

To demonstrate the dataset’s utility, the team trains LLM, CLIP, and LLaVA architectures to reason jointly about text and design targets and evaluates them on tasks from the Therapeutic Data Commons. In supervised prediction problems that use Medex for pretraining, their best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks and perform comparably to 9B models on average. The authors also show that models built with Medex can be applied as constraints while optimizing for novel molecules in GuacaMol, producing proposals that are safer and nearly as effective. The release aims to enable models with stronger experimental priors and to improve safety in computational therapeutic design.

62

Impact Score

Have large language models plateaued

A Hacker News thread debates whether large language models have plateaued or whether recent gains come from better tooling and applications, with autonomous Artificial Intelligence agents showing striking demos and notable failures.

China eyes chip-stacking to narrow gap with NVIDIA

Wei Shaojun said China could narrow its technology gap with NVIDIA by stacking 14 nm logic chips with 18 nm DRAM and new compute architectures. The approach is aimed at improving Artificial Intelligence performance and energy efficiency while relying on a fully domestic supply chain.

Pat Gelsinger’s xLight gets tentative U.S. support for EUV FELs

The U.S. Department of Commerce has signed a non-binding letter of intent to support xLight, a venture-backed startup focused on EUV Free Electron Lasers, under the CHIPS and Science Act, paving the way for (up to) NULL million in government funding. The company, which added Pat Gelsinger as executive chairman, plans to build its first system at the Albany Nanotech Complex.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.