Phylogenetically Balanced CDS Datasets for Improved Expression Modeling

Researcher(s)

  • Tom Le, Biomedical Engineering, University of Delaware

Faculty Mentor(s)

  • Jason Gleghorn, Biomedical Engineering, University of Delaware

Abstract

Proteins are ubiquitous macromolecules that govern biochemistry. Their integral role in homeostasis makes them key drug targets, or potential drugs as therapeutic peptides. Typically, genetically modified bacteria and yeast are used to produce desired peptides for said therapeutic applications. Within these modified organisms are sections of DNA sequences that translate to codons, for which there can be many synonymous codons for a given amino acid. However, designing a coding region for a valuable peptide naively, without considering the optimal codon sequence for expression, can lead to inefficient protein production. This can cost the pharmaceutical industry a small fortune every year. Our lab has previously produced codon language models that describe proteins in a numerical feature space. To improve our models to optimize codon sequences for expression, as well as predict expression, we propose the creation of a novel phylogenetically balanced codon sequence dataset. In order to construct the dataset, almost all available genome CoDing Sequences (CDS) with corresponding taxonomic information were queried from NCBI using command-line tools. Coding sequences were converted into codon sequences and the taxonomic information was used to map the sequence to a broader phylogeny. Sequences were mapped to be either Animal, Plant, Fungi, Protist, Archaea or one of Bacteria’s phylums. The dataset was curated to ensure the number of sequences mapped to a taxonomic group were roughly equal to prevent computational models biasing one taxonomic group over another. Homo sapien and Mus musculus codon sequences were excluded from the dataset because both species have been curated with more rigor by the Consensus CoDing Sequence project (CCDS). Overall, this dataset aims to train machine learning models that can lead to an increase in therapeutic peptide production and decreased drug production costs, ultimately lowering drug costs and medical expenses and improving patient access to these vital treatments.