Ribeiro, Caio, Freitas, Alex A. (2021) A data-driven missing value imputation approach for longitudinal datasets. Artificial Intelligence Review, . ISSN 0269-2821. E-ISSN 1573-7462. (doi:10.1007/s10462-021-09963-5) (KAR id:88186)
PDF
Publisher pdf
Language: English
This work is licensed under a Creative Commons Attribution 4.0 International License.
|
|
Download this file (PDF/1MB) |
Preview |
Request a format suitable for use with assistive technology e.g. a screenreader | |
PDF
Author's Accepted Manuscript
Language: English Restricted to Repository staff only |
|
Contact us about this Publication
|
|
Official URL: https://doi.org/10.1007/s10462-021-09963-5 |
Abstract
Longitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.
Item Type: | Article |
---|---|
DOI/Identification number: | 10.1007/s10462-021-09963-5 |
Uncontrolled keywords: | machine learning, data mining, missing values, longitudinal classification, longitudinal datasets |
Subjects: | Q Science > Q Science (General) > Q335 Artificial intelligence |
Divisions: | Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing |
Depositing User: | Alex Freitas |
Date Deposited: | 16 May 2021 10:01 UTC |
Last Modified: | 05 Nov 2024 12:54 UTC |
Resource URI: | https://kar.kent.ac.uk/id/eprint/88186 (The current URI for this page, for reference purposes) |
- Link to SensusAccess
- Export to:
- RefWorks
- EPrints3 XML
- BibTeX
- CSV
- Depositors only (login required):