Optimization of Sequence Generation Schemes for Advancements in Generative Protein Language Modeling

Researcher(s)

  • Colin Horger, Biomedical Engineering, University of Delaware

Faculty Mentor(s)

  • Jason Gleghorn, Department of Biomedical Engineering, University of Delaware

Abstract

Proteins are vital macromolecules associated with the regulation of biological processes. Designing proteins with specific properties has immense potential for therapeutics and industrial catalysis. In recent years, newfound advances in AI have enabled new techniques to describe proteins as a biological language of amino acid characters. Such language modeling techniques have culminated in the field of Protein Language Models (pLMs), where transformer neural networks are trained to design protein sequences. Our lab has recently developed the Annotation Transformer (AT), which represents protein properties in a concise way. This gives us a strong vocabulary of protein property annotations for use in training the pLM. We developed a Generative Sequence Model (GSM) which takes protein properties as prompts and generates plausible sequences with those characteristics. However, not all sequences generated are accurate, and many generated sequences do not mimic natural sequences. To optimize the generative capabilities of GSM, this project explores various parameters for generation with BERT-like transformers. With this bidirectional context, we compute the probability for an amino acid, utilizing different sampling methods to pick an amino acid based on those probabilities. Choosing the appropriate amino acid is a challenging task that can greatly affect the downstream potential of a pLM. To generate the sequences we explore iterative sampling, nucleus sampling, and temperature scaling techniques to optimize sequence generation over various masking percentages. Overall, our work showcases that GSM outperforms its sequence-only counterpart ESM (Evolutionary Scale Model), by utilizing protein property prompts to guide the results. This presents a promising avenue to generate proteins for valuable tasks with significant downstream potential.