Autocompleting Protein Sequences with Protein Language Models

Researcher(s)

  • Colin Horger, Biomedical Engineering, University of Delaware

Faculty Mentor(s)

  • Jason Gleghorn, Department of Biomedical Engineering, University of Delaware
  • Logan Hallee, Department of Bioinformatics, University of Delaware

Abstract

Autocompleting Protein Sequences with Protein Language Models
Colin Horger1, Logan Hallee2, Jason Gleghorn3
Department of Biomedical Engineering1,3 and Bioinformatics2,3, University of Delaware, Newark DE 19716

Artificial Intelligence (AI) has greatly improved the way in which we analyze and observe information. In the sphere of biomedical research, we have been able to use AI to discover more about protein sequences, allowing for the high-throughput annotation of structure, function, and more. In this project we propose fundamental research to uncover how well machine learning models can “autocomplete” sequences. By treating amino acid sequences as a semantic language, protein language models (pLMs) allow complex mathematics to act on strings of letters. ANKH, a highly effective and efficient pLM, has shown remarkable sequence recovery, generation, and understanding. What is unknown, however, is what percentage of a sequence is needed in order for any pLM to complete the remaining residues. In our experiments we provide various percentages of full sequences and fine-tune ANKH for autocompletion. The generated sequences are measured by using scaled sequence alignment with their corresponding labels; a novel metric that we compare against a distribution of hundreds of thousands of random proteins. By elucidating the generative qualities of pLMs we open doors in biomedical research towards protein design and optimization. We hope to extend this work to validate the qualities of generated protein and explore if autocomplete techniques uncover more thermodynamically stable or active molecules.