Artificial intelligence model boosts yeast based protein drug production

MIT chemical engineers trained a species specific language model to design yeast friendly gene sequences, improving production of several therapeutic proteins and highlighting limits of traditional codon optimization metrics.

MIT chemical engineers have developed a large language model that learns the codon “grammar” of the industrial yeast Komagataella phaffii and uses it to design gene sequences that the organism translates more efficiently. Codons are three letter DNA units that encode amino acids, and because the genetic code includes 64 possible codons for only 20 amino acids, organisms evolve strong preferences for particular synonymous codons and local sequence patterns. Standard codon optimization tools typically emphasize the most frequent codons, but they often overlook how codon context, transfer RNA availability, and regulatory motifs shape real world protein expression. The new model, built as a GRU based encoder decoder network, was trained on amino acid sequences and matching coding DNA from roughly 5,000 native K. phaffii proteins sourced from a public National Center for Biotechnology Information dataset, allowing it to infer species specific usage patterns without hand coded rules.

After training, the researchers used the model to generate codon optimized DNA sequences for six recombinant proteins of varying size and complexity: human growth hormone, human granulocyte colony stimulating factor, a VHH nanobody called 3B2, an engineered SARS CoV 2 receptor binding domain, human serum albumin, and the IgG1 monoclonal antibody trastuzumab. They compared these sequences against versions produced by four commercial tools from Azenta, IDT, GenScript, and Thermo Fisher by inserting each construct into K. phaffii and measuring resulting protein titers. Across the six proteins, the MIT model produced the highest titer for five and ranked second for the remaining one. For human growth hormone and human granulocyte colony stimulating factor, the team observed about a 25% improvement, while human serum albumin showed about a threefold improvement when comparing optimized constructs to the native coding sequence. For native serum albumin sequences, human serum albumin reached a titer of 45 mg/L, while bovine serum albumin and mouse serum albumin reached 60 mg/L and 100 mg/L, respectively, and codon optimization increased bovine and mouse serum albumin titers by an additional 25%, to 75 mg/L and 135 mg/L.

The study also dissected what the model learned internally and how its designs differ from conventional metrics. Visualizations of learned amino acid embeddings showed clusters organized by physicochemical traits, including aliphatic, aromatic, basic, acid/amide, and alcohol groups, with hydrophobic residues grouping together and polar residues grouping together. Constructs designed by the model for the six tested proteins contained no negative cis regulatory elements in the analysis described and also avoided negative repeat elements, despite not being explicitly trained to filter these features. In contrast, global codon usage measures such as the Codon Adaptation Index and codon pair metrics did not consistently correlate with final titers, and in some cases higher Codon Adaptation Index scores were associated with lower yields. The researchers note that the model is trained for a single host species and that models trained on other organisms, including humans and cows, produce different predictions, underscoring the need for species specific approaches. They position the work as one lever among many in biomanufacturing, arguing that more predictive codon design can cut process development uncertainty and help move new protein based drugs into production more quickly, while acknowledging that cellular engineering, media formulation, and process optimization remain critical components.

58

Impact Score

Ajinomoto’s quiet grip on a material powering Artificial Intelligence chips

Japanese food giant Ajinomoto has become a critical chokepoint in the semiconductor supply chain by controlling nearly all production of a specialized insulating film used in advanced Artificial Intelligence processors. Its Ajinomoto Build-up Film underpins high performance Nvidia-style chips and is extremely difficult for rivals to replicate.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.