Skip to main content
Kent Academic Repository

A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach

Akyea, Ralph K., Ntaios, George, Kontopantelis, Evangelos, Georgiopoulos, Georgios, Soria, Daniele, Asselbergs, Folkert W., Kai, Joe, Weng, Stephen F., Qureshi, Nadeem (2023) A population-based study exploring phenotypic clusters and clinical outcomes in stroke using unsupervised machine learning approach. PLOS Digital Health, 2 (9). Article Number e0000334. E-ISSN 2767-3170. (doi:10.1371/journal.pdig.0000334) (KAR id:102765)


Individuals developing stroke have varying clinical characteristics, demographic, and biochemical profiles. This heterogeneity in phenotypic characteristics can impact on cardiovascular disease (CVD) morbidity and mortality outcomes. This study uses a novel clustering approach to stratify individuals with incident stroke into phenotypic clusters and evaluates the differential burden of recurrent stroke and other cardiovascular outcomes. We used linked clinical data from primary care, hospitalisations, and death records in the UK. A data-driven clustering analysis (kamila algorithm) was used in 48,114 patients aged ≥ 18 years with incident stroke, from 1-Jan-1998 to 31-Dec-2017 and no prior history of serious vascular events. Cox proportional hazards regression was used to estimate hazard ratios (HRs) for subsequent adverse outcomes, for each of the generated clusters. Adverse outcomes included coronary heart disease (CHD), recurrent stroke, peripheral vascular disease (PVD), heart failure, CVD-related and all-cause mortality. Four distinct phenotypes with varying underlying clinical characteristics were identified in patients with incident stroke. Compared with cluster 1 (n = 5,201, 10.8%), the risk of composite recurrent stroke and CVD-related mortality was higher in the other 3 clusters (cluster 2 [n = 18,655, 38.8%]: hazard ratio [HR], 1.07; 95% CI, 1.02–1.12; cluster 3 [n = 10,244, 21.3%]: HR, 1.20; 95% CI, 1.14–1.26; and cluster 4 [n = 14,014, 29.1%]: HR, 1.44; 95% CI: 1.37–1.50). Similar trends in risk were observed for composite recurrent stroke and all-cause mortality outcome, and subsequent recurrent stroke outcome. However, results were not consistent for subsequent risk in CHD, PVD, heart failure, CVD-related mortality, and all-cause mortality. In this proof of principle study, we demonstrated how a heterogenous population of patients with incident stroke can be stratified into four relatively homogenous phenotypes with differential risk of recurrent and major cardiovascular outcomes. This offers an opportunity to revisit the stratification of care for patients with incident stroke to improve patient outcomes.

Item Type: Article
DOI/Identification number: 10.1371/journal.pdig.0000334
Additional information: For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
Uncontrolled keywords: cardiovascular disease risk; clustering algorithms; stroke; peripheral vascular disease; heart failure; cardiovascular diseases; cholesterol; corbidity
Subjects: Q Science > Q Science (General)
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Computing
Funders: National Institute for Health Research (
SWORD Depositor: JISC Publications Router
Depositing User: JISC Publications Router
Date Deposited: 18 Sep 2023 13:26 UTC
Last Modified: 10 Jan 2024 05:34 UTC
Resource URI: (The current URI for this page, for reference purposes)

University of Kent Author Information

  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.