Skip to main content
Kent Academic Repository

Bayesian variable selection in cluster analysis

Dimitrakopoulou, Vasiliki (2012) Bayesian variable selection in cluster analysis. Doctor of Philosophy (PhD) thesis, University of Kent. (doi:10.22024/UniKent/01.02.94305) (KAR id:94305)


Statistical analysis of data sets of high-dimensionality has met great interest over the past years, with great applications on disciplines such as medicine, neuroscience, pattern recognition, image analysis and many others. The vast number of available variables though, contrary to the limited sample size, often mask the cluster structure of the data. It is often that some variables do not help in distinguishing the different clusters in the data; patterns over the sampled observations are, thus, usually confined to a small subset of variables. We are therefore interested in identifying the variables that best discriminate the sample, simultaneously to recovering the actual cluster structure of the objects under study. With the Markov Chain Monte Carlo methodology being widely established, we investigate the performance of the combined tasks of variable selection and clustering procedure within the Bayesian framework. Motivated by the work of Tadesse et al. (2005), we identify the set of discriminating variables with the use of a latent vector and form the clustering procedure within the finite mixture models methodology. Using Markov chains we draw inference on, not just the set of selected variables and the cluster allocations, but also on the actual number of components: using the f:teversible Jump MCMC sampler (Green, 1995) and a variation of t he SAMS sampler of Dahl (2005). However, sensitivity to the hyperparameters settings of the covariance structure of the suggested model motivated our interest in an Empirical Bayes procedure to pre-specify the crucial hyper parameters. Further on addressing the problem of hyperparameters' sensitivity, we suggest several different covariance structures for the mixture components. Developing MATLAB codes for all models introduced in this thesis, we apply and compare the various models suggested on a set of simulated data, as well as on three real data sets; the iris, the crabs and the arthritis data sets.

Item Type: Thesis (Doctor of Philosophy (PhD))
DOI/Identification number: 10.22024/UniKent/01.02.94305
Additional information: This thesis has been digitised by EThOS, the British Library digitisation service, for purposes of preservation and dissemination. It was uploaded to KAR on 25 April 2022 in order to hold its content and record within University of Kent systems. It is available Open Access using a Creative Commons Attribution, Non-commercial, No Derivatives ( licence so that the thesis and its author, can benefit from opportunities for increased readership and citation. This was done in line with University of Kent policies ( If you feel that your rights are compromised by open access to this thesis, or if you would like more information about its availability, please contact us at and we will seriously consider your claim under the terms of our Take-Down Policy (
Subjects: Q Science > QA Mathematics (inc Computing science)
Q Science > QA Mathematics (inc Computing science) > QA276 Mathematical statistics
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Mathematics, Statistics and Actuarial Science
SWORD Depositor: SWORD Copy
Depositing User: SWORD Copy
Date Deposited: 15 Jun 2023 09:45 UTC
Last Modified: 15 Jun 2023 09:45 UTC
Resource URI: (The current URI for this page, for reference purposes)

University of Kent Author Information

  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.