Hall, Michael (2025) Deep Learning for Functional Annotation of Short Proteins: A novel neural network to assign Gene Ontology functional annotations to proteins under 150 amino acids long. Master of Research (MRes) thesis, University of Kent,. (doi:10.22024/UniKent/01.02.109145) (Access to this publication is currently restricted. You may be able to access a copy if URLs are provided) (KAR id:109145)
PDF
Language: English Restricted to Repository staff only until March 2028.
This work is licensed under a Creative Commons Attribution 4.0 International License.
|
|
Contact us about this Publication
|
![]() |
Official URL: https://doi.org/10.22024/UniKent/01.02.109145 |
Abstract
Understanding a protein’s function is vital in many fields, such as drug discovery and therapeutic development, agriculture and environmental science. Next-generation sequencing has populated databases such as UniProt[1], but many of these proteins do not have experimentally determined functional annotations. Experimental techniques for determining protein function are time consuming and expensive to run. Traditionally, protein function prediction has relied on sequence homology tools such as the Basic Local Alignment Search Tool (BLAST)[2], which is limited in scope and can only infer the function of a query protein from a well characterised protein with a similar primary structure. In this thesis, I present a novel deep learning method to predict short protein function from amino acid sequences. The model leverages sequence homology, protein familial domain, co-expression and protein-protein interaction data to assign a binary prediction to a Gene Ontology term [3, 4] for a given protein. The large corpus of annotated sequence data from the UniProtKB/Swiss-Prot was used to train the model; precisely, small proteins of 150 residues or less were used to capture patterns across compact functional domains in an attempt to produce a more robust model. The model can distinguish between negative and positive GO terms for a given protein; it achieves an area under the receiver operating characteristic curve (AUROC) score of 0.96. Furthermore, the recall of 88% shows that the model can identify a large proportion of a given protein’s annotations. However, due to the significant class imbalance across the dataset, the model tends to over-assign functional annotations, which is reflected in a precision of 54%. Our results suggest that deep learning can significantly improve protein function prediction, and I explore the current limitations of this method.
Item Type: | Thesis (Master of Research (MRes)) |
---|---|
Thesis advisor: | Wass, Mark |
Thesis advisor: | Michaelis, Martin |
DOI/Identification number: | 10.22024/UniKent/01.02.109145 |
Subjects: | Q Science > QH Natural history |
Divisions: | Divisions > Division of Natural Sciences > Biosciences |
SWORD Depositor: | System Moodle |
Depositing User: | System Moodle |
Date Deposited: | 14 Mar 2025 08:21 UTC |
Last Modified: | 17 Mar 2025 12:46 UTC |
Resource URI: | https://kar.kent.ac.uk/id/eprint/109145 (The current URI for this page, for reference purposes) |
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):