Deep Learning for Functional Annotation of Short Proteins: A novel neural network to assign Gene Ontology functional annotations to proteins under 150 amino acids long

Hall, Michael (2025) Deep Learning for Functional Annotation of Short Proteins: A novel neural network to assign Gene Ontology functional annotations to proteins under 150 amino acids long. Master of Research (MRes) thesis, University of Kent,. (doi:10.22024/UniKent/01.02.109145) (Access to this publication is currently restricted. You may be able to access a copy if URLs are provided) (KAR id:109145)

PDF Language: English Restricted to Repository staff only until March 2028. This work is licensed under a Creative Commons Attribution 4.0 International License.
Contact us about this Publication
Official URL: https://doi.org/10.22024/UniKent/01.02.109145

Abstract

Understanding a protein’s function is vital in many fields, such as drug discovery and therapeutic development, agriculture and environmental science. Next-generation sequencing has populated databases such as UniProt[1], but many of these proteins do not have experimentally determined functional annotations. Experimental techniques for determining protein function are time consuming and expensive to run. Traditionally, protein function prediction has relied on sequence homology tools such as the Basic Local Alignment Search Tool (BLAST)[2], which is limited in scope and can only infer the function of a query protein from a well characterised protein with a similar primary structure. In this thesis, I present a novel deep learning method to predict short protein function from amino acid sequences. The model leverages sequence homology, protein familial domain, co-expression and protein-protein interaction data to assign a binary prediction to a Gene Ontology term [3, 4] for a given protein. The large corpus of annotated sequence data from the UniProtKB/Swiss-Prot was used to train the model; precisely, small proteins of 150 residues or less were used to capture patterns across compact functional domains in an attempt to produce a more robust model. The model can distinguish between negative and positive GO terms for a given protein; it achieves an area under the receiver operating characteristic curve (AUROC) score of 0.96. Furthermore, the recall of 88% shows that the model can identify a large proportion of a given protein’s annotations. However, due to the significant class imbalance across the dataset, the model tends to over-assign functional annotations, which is reflected in a precision of 54%. Our results suggest that deep learning can significantly improve protein function prediction, and I explore the current limitations of this method.

Item Type:	Thesis (Master of Research (MRes))
Thesis advisor:	Wass, Mark
Thesis advisor:	Michaelis, Martin
DOI/Identification number:	10.22024/UniKent/01.02.109145
Subjects:	Q Science > QH Natural history
Divisions:	Divisions > Division of Natural Sciences > Biosciences
SWORD Depositor:	System Moodle
Depositing User:	System Moodle
Date Deposited:	14 Mar 2025 08:21 UTC
Last Modified:	17 Mar 2025 12:46 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/109145 (The current URI for this page, for reference purposes)

University of Kent Author Information

Hall, Michael.

Creator's ORCID:
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.