A data-driven missing value imputation approach for longitudinal datasets

Ribeiro, Caio, Freitas, Alex A. (2021) A data-driven missing value imputation approach for longitudinal datasets. Artificial Intelligence Review, . ISSN 0269-2821. E-ISSN 1573-7462. (doi:10.1007/s10462-021-09963-5) (KAR id:88186)

PDF Publisher pdf Language: English This work is licensed under a Creative Commons Attribution 4.0 International License.
Download this file (PDF/1MB)	Preview
Request a format suitable for use with assistive technology e.g. a screenreader
PDF Author's Accepted Manuscript Language: English Restricted to Repository staff only
Contact us about this publication
Official URL: https://doi.org/10.1007/s10462-021-09963-5

Abstract

Longitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.

Item Type:	Article
DOI/Identification number:	10.1007/s10462-021-09963-5
Uncontrolled keywords:	machine learning, data mining, missing values, longitudinal classification, longitudinal datasets
Subjects:	Q Science > Q Science (General) > Q335 Artificial intelligence
Institutional Unit:	Schools > School of Computing
Former Institutional Unit:	Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Depositing User:	Alex Freitas
Date Deposited:	16 May 2021 10:01 UTC
Last Modified:	28 Apr 2026 09:21 UTC
Resource URI:	https://kar.kent.ac.uk/id/eprint/88186 (The current URI for this page, for reference purposes)

University of Kent Author Information

Ribeiro, Caio.

Creator's ORCID:	https://orcid.org/0000-0002-8125-8059
CReDIT Contributor Roles:

Freitas, Alex A..

Creator's ORCID:	https://orcid.org/0000-0001-9825-4700
CReDIT Contributor Roles:

Depositors only (login required):

Altmetric

Total Views

Total unique views of this page since July 2020. For more details click on the image.