Artificial intelligence model boosts yeast based protein drug production

MIT chemical engineers trained a species specific language model to design yeast friendly gene sequences, improving production of several therapeutic proteins and highlighting limits of traditional codon optimization metrics.

MIT chemical engineers have developed a large language model that learns the codon “grammar” of the industrial yeast Komagataella phaffii and uses it to design gene sequences that the organism translates more efficiently. Codons are three letter DNA units that encode amino acids, and because the genetic code includes 64 possible codons for only 20 amino acids, organisms evolve strong preferences for particular synonymous codons and local sequence patterns. Standard codon optimization tools typically emphasize the most frequent codons, but they often overlook how codon context, transfer RNA availability, and regulatory motifs shape real world protein expression. The new model, built as a GRU based encoder decoder network, was trained on amino acid sequences and matching coding DNA from roughly 5,000 native K. phaffii proteins sourced from a public National Center for Biotechnology Information dataset, allowing it to infer species specific usage patterns without hand coded rules.

After training, the researchers used the model to generate codon optimized DNA sequences for six recombinant proteins of varying size and complexity: human growth hormone, human granulocyte colony stimulating factor, a VHH nanobody called 3B2, an engineered SARS CoV 2 receptor binding domain, human serum albumin, and the IgG1 monoclonal antibody trastuzumab. They compared these sequences against versions produced by four commercial tools from Azenta, IDT, GenScript, and Thermo Fisher by inserting each construct into K. phaffii and measuring resulting protein titers. Across the six proteins, the MIT model produced the highest titer for five and ranked second for the remaining one. For human growth hormone and human granulocyte colony stimulating factor, the team observed about a 25% improvement, while human serum albumin showed about a threefold improvement when comparing optimized constructs to the native coding sequence. For native serum albumin sequences, human serum albumin reached a titer of 45 mg/L, while bovine serum albumin and mouse serum albumin reached 60 mg/L and 100 mg/L, respectively, and codon optimization increased bovine and mouse serum albumin titers by an additional 25%, to 75 mg/L and 135 mg/L.

The study also dissected what the model learned internally and how its designs differ from conventional metrics. Visualizations of learned amino acid embeddings showed clusters organized by physicochemical traits, including aliphatic, aromatic, basic, acid/amide, and alcohol groups, with hydrophobic residues grouping together and polar residues grouping together. Constructs designed by the model for the six tested proteins contained no negative cis regulatory elements in the analysis described and also avoided negative repeat elements, despite not being explicitly trained to filter these features. In contrast, global codon usage measures such as the Codon Adaptation Index and codon pair metrics did not consistently correlate with final titers, and in some cases higher Codon Adaptation Index scores were associated with lower yields. The researchers note that the model is trained for a single host species and that models trained on other organisms, including humans and cows, produce different predictions, underscoring the need for species specific approaches. They position the work as one lever among many in biomanufacturing, arguing that more predictive codon design can cut process development uncertainty and help move new protein based drugs into production more quickly, while acknowledging that cellular engineering, media formulation, and process optimization remain critical components.

58

Impact Score

Artificial Intelligence speeds quantum encryption threat timeline

Research from Google and Oratomic suggests quantum computers capable of breaking core internet encryption may arrive sooner than expected. Artificial Intelligence played a key role in improving one of the new algorithms, raising fresh urgency around post-quantum security.

New methods aim to improve Large Language Model reasoning

A new study on arXiv outlines algorithmic techniques designed to strengthen Large Language Model reasoning and reduce hallucinations. The work reports better logical consistency and stronger performance on mathematical and coding benchmarks.

Nvidia acquisition of SchedMD raises Slurm neutrality concerns

Nvidia’s purchase of SchedMD has given it control of Slurm, an open-source scheduler that sits at the center of many supercomputing and large-model training systems. Researchers and engineers are watching for signs that support could tilt toward Nvidia hardware over AMD and Intel alternatives.

Mustafa Suleyman says Artificial Intelligence compute growth is still accelerating

Mustafa Suleyman argues that Artificial Intelligence development is being propelled by simultaneous advances in chips, memory, networking, and software efficiency rather than nearing a hard limit. He contends that rising compute capacity and falling deployment costs will push systems beyond chatbots toward more capable agents.

China and the US are leading different Artificial Intelligence races

The US leads in large language models and advanced chips, while China has built a major advantage in robotics and humanoid manufacturing. That balance is shifting as Chinese developers narrow the gap in model performance and both countries push to combine software and machines.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.