Skip to main content

Investigating the Role of Simpson’s Paradox in the Analysis of Top-Ranked Features in High-Dimensional Bioinformatics Datasets

Freitas, Alex A. (2019) Investigating the Role of Simpson’s Paradox in the Analysis of Top-Ranked Features in High-Dimensional Bioinformatics Datasets. Briefings in Bioinformatics, 21 (2). pp. 421-428. E-ISSN 1477-4054. (doi:10.1093/bib/bby126) (KAR id:72582)

PDF Author's Accepted Manuscript
Language: English
Download (338kB) Preview
[thumbnail of Brief-BioInfo-2019-post-print.pdf]
This file may not be suitable for users of assistive technology.
Request an accessible format
Official URL


An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning-based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area have, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson’s paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson’s paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson’s paradox involving top-ranked predictors are much more common for one of the feature ranking methods.

Item Type: Article
DOI/Identification number: 10.1093/bib/bby126
Uncontrolled keywords: Gene Ontology, machine learning, classification, feature ranking, ageing-related genes
Subjects: Q Science > Q Science (General) > Q335 Artificial intelligence
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User: Alex Freitas
Date Deposited: 18 Feb 2019 17:55 UTC
Last Modified: 16 Feb 2021 14:02 UTC
Resource URI: (The current URI for this page, for reference purposes)
Freitas, Alex A.:
  • Depositors only (login required):


Downloads per month over past year