Freitas, Alex A. (2019) Investigating the Role of Simpson’s Paradox in the Analysis of Top-Ranked Features in High-Dimensional Bioinformatics Datasets. Briefings in Bioinformatics, 21 (2). pp. 421-428. E-ISSN 1477-4054. (doi:10.1093/bib/bby126) (KAR id:72582)
|
PDF
Author's Accepted Manuscript
Language: English |
|
|
Download this file (PDF/635kB) |
Preview |
| Request a format suitable for use with assistive technology e.g. a screenreader | |
| Official URL: https://doi.org/10.1093/bib/bby126 |
|
| Additional URLs: |
|
Abstract
An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning-based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area have, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson’s paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson’s paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson’s paradox involving top-ranked predictors are much more common for one of the feature ranking methods.
| Item Type: | Article |
|---|---|
| DOI/Identification number: | 10.1093/bib/bby126 |
| Uncontrolled keywords: | Gene Ontology, machine learning, classification, feature ranking, ageing-related genes |
| Subjects: | Q Science > Q Science (General) > Q335 Artificial intelligence |
| Institutional Unit: | Schools > School of Computing |
| Former Institutional Unit: |
Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
|
| Depositing User: | Alex Freitas |
| Date Deposited: | 18 Feb 2019 17:55 UTC |
| Last Modified: | 20 May 2025 10:23 UTC |
| Resource URI: | https://kar.kent.ac.uk/id/eprint/72582 (The current URI for this page, for reference purposes) |
- Link to SensusAccess
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):

https://orcid.org/0000-0001-9825-4700
Altmetric
Altmetric