CodonBERT: A Novel Machine Learning Approach to Improve Protein Semantic Understanding

Researcher(s)

  • Nikolaos Rafailidis, Biological Sciences, University of Delaware

Faculty Mentor(s)

  • Jason Gleghorn, Biomedical Engineering, University of Delaware

Abstract

Recent advancements have revealed the potential of codon usage patterns as indicators of taxonomic identities, organelle type, and even structural differences in protein domains. The biological rationale for these counterintuitive properties, which seemingly challenge popular understanding, is often attributed to evolutionary selection pressures. Because of the robust predictive power of codon usage and its newfound relevance in protein structure we decided to investigate this quirk of biological regulation. Here, we used a pretrained protein language model (pLM) with an expanded vocabulary, thus creating the first codon-based pLM focused on the translated regions of proteins. Our unique approach extended traditional transfer learning by incorporating token embedding matrix seeding. To ensure high-quality and diverse data, we obtained protein sequences from NIH’s CCDS library for human and mouse, as well as the Ensembl project with over 300 additional organisms. We further trained the ProtBERT model on those 9.5 million protein sequences that were translated into codon sequences using a novel single letter code for each codon; we call this CodonBERT. A PCA plot of CodonBERT embeddings produced distinct groupings of synonymous codons based on their amino acid charge and polarity features. Notably, while these synonymous codons were closely clustered in the space, they are discernibly separate from each other. This observation signified a nontrivial shift in the embeddings, showcasing CodonBERT’s capability to effectively capture and differentiate patterns of synonymous codon usage. The practical implications of our research offer avenues for enhanced annotation of genomic data, improved prediction of protein structure, and more sophisticated protein design methodologies. By elucidating the complexities of codon sequences we pave the way for refined pLMs, ultimately driving advancements in structural bioinformatics and molecular biology.