Jungjit, Suwimol (2016) New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics. Doctor of Philosophy (PhD) thesis, University of Kent,. (KAR id:58873)
PDF
Language: English |
|
Download this file (PDF/3MB) |
Abstract
The very large dimensionality of real world datasets is a challenging problem for classification algorithms, since often many features are redundant or irrelevant for classification. In addition, a very large number of features leads to a high computational time for classification algorithms. Feature selection methods are used to deal with the large dimensionality of data by selecting a relevant feature subset according to an evaluation criterion. The vast majority of research on feature selection involves conventional single-label classification problems, where each instance is assigned a single class label; but there has been growing research on more complex multi-label classification problems, where each instance can be assigned multiple class labels.
This thesis proposes three types of new Multi-Label Correlation-based Feature Selection (ML-CFS) methods, namely: (a) methods based on hill-climbing search, (b) methods that exploit biological knowledge (still using hill-climbing search), and (c) methods based on genetic algorithms as the search method.
Firstly, we proposed three versions of ML-CFS methods based on hill climbing search. In essence, these ML-CFS versions extend the original CFS method by extending the merit function (which evaluates candidate feature subsets) to the multi-label classification scenario, as well as modifying the merit function in other ways. A conventional search strategy, hill-climbing, was used to explore the space of candidate solutions (candidate feature subsets) for those three versions of ML-CFS. These ML-CFS versions are described in detail in Chapter 4.
Secondly, in order to try to improve the performance of ML-CFS in cancer-related microarray gene expression datasets, we proposed three versions of the ML-CFS method that exploit biological knowledge. These ML-CFS versions are also based on hill-climbing search, but the merit function was modified in a way that favours the selection of genes (features) involved in pre-defined cancer-related pathways, as discussed in detail in Chapter 5.
Lastly, we proposed two more sophisticated versions of ML-CFS based on Genetic Algorithms (rather than hill-climbing) as the search method. The first version of GA-based ML-CFS is based on a conventional single-objective GA, where there is only one objective to be optimized; while the second version of GA-based ML-CFS performs lexicographic multi-objective optimization, where there are two objectives to be optimized, as discussed in detail in Chapter 6.
In this thesis, all proposed ML-CFS methods for multi-label classification problems were evaluated by measuring the predictive accuracies obtained by two well-known multi-label classification algorithms when using the selected features? namely: the Multi-Label K-Nearest neighbours (ML-kNN) algorithm and the Multi-Label Back Propagation Multi-Label Learning Neural Network (BPMLL) algorithm.
In general, the results obtained by the best version of the proposed ML-CFS methods, namely a GA-based ML-CFS method, were competitive with the results of other multi-label feature selection methods and baseline approaches. More precisely, one of our GA-based methods achieved the second best predictive accuracy out of all methods being compared (both with ML-kNN and BPMLL used as classifiers), but there was no statistically significant difference between that GA-based ML-CFS and the best method in terms of predictive accuracy. In addition, in the experiment with ML-kNN (the most accurate) method selects about twice as many features as our GA-based ML-CFS; whilst in the experiments with BPMLL the most accurate method was a baseline method that does not perform any feature selection, and runs the classifier once (with all original features) for each of the many class labels, which is a very computationally expensive baseline approach.
In summary, one of the proposed GA-based ML-CFS methods managed to achieve substantial data reduction, (selecting a smaller subset of relevant features) without a significant decrease in predictive accuracy with respect to the most accurate method.
Item Type: | Thesis (Doctor of Philosophy (PhD)) |
---|---|
Thesis advisor: | Freitas, Alex |
Uncontrolled keywords: | Feature Selection, Multi-Label Feature Selection Method, Multi-Label Classification |
Subjects: |
Q Science T Technology |
Divisions: | Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing |
Depositing User: | Users 1 not found. |
Date Deposited: | 22 Nov 2016 16:00 UTC |
Last Modified: | 05 Nov 2024 10:50 UTC |
Resource URI: | https://kar.kent.ac.uk/id/eprint/58873 (The current URI for this page, for reference purposes) |
- Link to SensusAccess
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):