Medex: a dataset for distilling knowledge priors from literature for therapeutic design

Artificial Intelligence-driven discovery can reduce design time but may violate experimental priors. Medex is a literature-derived dataset of design priors containing 32.3 million pairs of natural language facts and entity representations to support safer therapeutic design.

Artificial Intelligence-driven discovery can shorten design cycles but risks proposing candidates that violate experimental constraints when models lack laboratory priors. The authors note that, in a new analysis across diverse models on the GuacaMol benchmark using supervised classifiers, over 60% of molecules proposed had a high probability of being mutagenic. To address this gap they introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings.

Medex is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. The dataset consists of 32.3 million pairs of natural language facts and appropriate entity representations, including SMILES or RefSeq IDs. The paper frames these pairs as knowledge priors that can be incorporated into model pretraining or used as constraints during optimization. The authors publish the dataset on HuggingFace at https://huggingface.co/datasets/DocAndDesign/Medex and say they will provide expanded versions as the available literature grows.

To demonstrate the dataset’s utility, the team trains LLM, CLIP, and LLaVA architectures to reason jointly about text and design targets and evaluates them on tasks from the Therapeutic Data Commons. In supervised prediction problems that use Medex for pretraining, their best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks and perform comparably to 9B models on average. The authors also show that models built with Medex can be applied as constraints while optimizing for novel molecules in GuacaMol, producing proposals that are safer and nearly as effective. The release aims to enable models with stronger experimental priors and to improve safety in computational therapeutic design.

62

Impact Score

Microsoft previews Shader Model 6.10 for gpu Artificial Intelligence engines

Microsoft has introduced Shader Model 6.10 in AgilitySDK 1.720-preview with a new matrix API designed to unify access to dedicated gpu Artificial Intelligence hardware from AMD, Intel, and NVIDIA. The change is aimed at making neural rendering features easier to deploy across multiple vendors with a single programming model.

Europe’s Artificial Intelligence challenge is structural dependence

Europe has talent, research strength, and rising investment in Artificial Intelligence, but startups remain reliant on American infrastructure, platforms, and late-stage capital. The argument centers on digital sovereignty, interoperability, and ownership as the conditions for building durable European champions.

Community backlash slows Artificial Intelligence data center expansion

Political resistance, regulatory scrutiny, and rising energy and water concerns are complicating the build-out of large Artificial Intelligence data centers across the United States. The pressure is increasing costs, delaying projects, and adding fresh risks to the economics behind Generative Artificial Intelligence infrastructure.

House panel advances export controls after China report

The House Foreign Affairs Committee moved export control legislation after a House Select Committee report detailed China’s use of illegal means to build its Artificial Intelligence and semiconductor sectors. The measure is aimed at chip smuggling and Artificial Intelligence model theft.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.