MIT chemical engineers have developed a large language model that learns the codon “grammar” of the industrial yeast Komagataella phaffii and uses it to design gene sequences that the organism translates more efficiently. Codons are three letter DNA units that encode amino acids, and because the genetic code includes 64 possible codons for only 20 amino acids, organisms evolve strong preferences for particular synonymous codons and local sequence patterns. Standard codon optimization tools typically emphasize the most frequent codons, but they often overlook how codon context, transfer RNA availability, and regulatory motifs shape real world protein expression. The new model, built as a GRU based encoder decoder network, was trained on amino acid sequences and matching coding DNA from roughly 5,000 native K. phaffii proteins sourced from a public National Center for Biotechnology Information dataset, allowing it to infer species specific usage patterns without hand coded rules.
After training, the researchers used the model to generate codon optimized DNA sequences for six recombinant proteins of varying size and complexity: human growth hormone, human granulocyte colony stimulating factor, a VHH nanobody called 3B2, an engineered SARS CoV 2 receptor binding domain, human serum albumin, and the IgG1 monoclonal antibody trastuzumab. They compared these sequences against versions produced by four commercial tools from Azenta, IDT, GenScript, and Thermo Fisher by inserting each construct into K. phaffii and measuring resulting protein titers. Across the six proteins, the MIT model produced the highest titer for five and ranked second for the remaining one. For human growth hormone and human granulocyte colony stimulating factor, the team observed about a 25% improvement, while human serum albumin showed about a threefold improvement when comparing optimized constructs to the native coding sequence. For native serum albumin sequences, human serum albumin reached a titer of 45 mg/L, while bovine serum albumin and mouse serum albumin reached 60 mg/L and 100 mg/L, respectively, and codon optimization increased bovine and mouse serum albumin titers by an additional 25%, to 75 mg/L and 135 mg/L.
The study also dissected what the model learned internally and how its designs differ from conventional metrics. Visualizations of learned amino acid embeddings showed clusters organized by physicochemical traits, including aliphatic, aromatic, basic, acid/amide, and alcohol groups, with hydrophobic residues grouping together and polar residues grouping together. Constructs designed by the model for the six tested proteins contained no negative cis regulatory elements in the analysis described and also avoided negative repeat elements, despite not being explicitly trained to filter these features. In contrast, global codon usage measures such as the Codon Adaptation Index and codon pair metrics did not consistently correlate with final titers, and in some cases higher Codon Adaptation Index scores were associated with lower yields. The researchers note that the model is trained for a single host species and that models trained on other organisms, including humans and cows, produce different predictions, underscoring the need for species specific approaches. They position the work as one lever among many in biomanufacturing, arguing that more predictive codon design can cut process development uncertainty and help move new protein based drugs into production more quickly, while acknowledging that cellular engineering, media formulation, and process optimization remain critical components.
