Skip to main content
Kent Academic Repository

New variants of random forest-based methods for survival analysis and applications to biomedical datasets

Pomsuwan, Tossapol (2023) New variants of random forest-based methods for survival analysis and applications to biomedical datasets. Doctor of Philosophy (PhD) thesis, University of Kent,. (KAR id:104296)

PDF
Language: English


Download this file
(PDF/1MB)
[thumbnail of 47Tossapol_Final_Revised_Thesis.pdf]
Preview

Abstract

Survival analysis problems involve predicting the time passed until the occurrence of an event of interest (the target variable), based on the values of some predictive features. Survival analysis is a specific type of supervised machine learning problem where the value of the target variable can be censored, meaning for some individuals, it may be known only that they survived (did not experience the event of interest) until a certain date, while it is unknown if the event of interest occurred after that date. Traditional supervised learning methods cannot directly cope with censored data, and so they need to be modified to properly address survival analysis problems.

This thesis focuses on the random forest algorithm, a popular and powerful supervised learning algorithm, and proposes new variants of random forest (RF) or RF-based algorithms for survival analysis.

The proposed RF or RF-based variants are evaluated on 11 survival analysis datasets created for this research, where the target variable is the time passed until an individual is diagnosed with a certain age-related disease. Most of these datasets were created by extracting relevant data from databases of longitudinal studies of ageing, so that the target variable denotes in general the time passed until an individual is diagnosed with some age-related disease.

This thesis has three main contributions, which involve proposing three new types of variants of RF or RF-based algorithms to cope with censored data in survival analysis problems, as follows.

The first contribution is to propose new RF variants with a modified procedure for creating subsets of training data to be used for learning the decision trees in a RF. This involves replacing the censored value of a target variable by another value which is then treated as an uncensored target value, allowing the other parts of a traditional RF algorithm to be applied without modification. Experiments with the 11 survival analysis datasets have shown that the proposed RF variants improved predictive accuracy in general when compared with the standard RF and some standard statistical methods for survival analysis, with statistical significance in some cases. However, the proposed RF variants were outperformed by a standard random survival forest (RSF) algorithm, a powerful RF-based algorithm developed specifically for survival analysis.

Motivated by the good performance of the RSF algorithm in the previously mentioned experiments, the second contribution of this thesis is to propose several new variants of the RSF algorithm. The proposed RSF variants focus on modifying two major components of the standard RSF algorithm: the criterion used for feature selection at each node of each tree in the forest, and the procedure used for computing the target variable value predicted by each leaf node of each tree. Experiments with the 11 survival analysis datasets have shown that, although the variations in the feature-selection criterion did not lead to significant differences in predictive accuracy, one of the variations in the procedure for computing the values predicted at leaf nodes achieved in general significantly higher accuracies than the standard RSF algorithm and the popular Cox Proportional Hazard (PH) algorithm.

The third contribution is to propose several new variants of the Deep Survival Forest (DSF) algorithm, which learns a more complex survival analysis model by stacking several learned RSF models into layers, inspired by deep learning principles. The proposed DSF variants focus on the base RSF algorithm used to learn the RSF models at each layer. More precisely, the proposed DSF variants replace the standard RSF algorithm with one of the RSF variants proposed earlier in this thesis, as base learners in each layer. Experiments with the 11 survival analysis datasets have shown that one of the proposed DSF variants achieved significantly higher predictive accuracy than the popular Cox PH algorithm and somewhat higher accuracy than the standard DSF in general.

In summary, this research has proposed new variants of RF or RF-based algorithms for coping with censored data in survival analysis problems; and in general the proposed algorithm variants have been shown to be competitive with (sometimes significantly more accurate than) standard methods for survival analysis.

Item Type: Thesis (Doctor of Philosophy (PhD))
Thesis advisor: Freitas, Alex
Uncontrolled keywords: survival analysis; machine learning; random forest; deep learning; biomedical data; Age-related disease
Subjects: Q Science > QA Mathematics (inc Computing science) > QA 76 Software, computer programming,
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
SWORD Depositor: System Moodle
Depositing User: System Moodle
Date Deposited: 14 Dec 2023 15:10 UTC
Last Modified: 05 Nov 2024 13:10 UTC
Resource URI: https://kar.kent.ac.uk/id/eprint/104296 (The current URI for this page, for reference purposes)

University of Kent Author Information

Pomsuwan, Tossapol.

Creator's ORCID:
CReDIT Contributor Roles:
  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.