Skip to main content

Statistical Methods For Detecting Genetic Risk Factors of a Disease with Applications to Genome-Wide Association Studies

Ali, Fadhaa (2015) Statistical Methods For Detecting Genetic Risk Factors of a Disease with Applications to Genome-Wide Association Studies. Doctor of Philosophy (PhD) thesis, University of Kent,. (KAR id:47963)

Language: English
Click to download this file (2MB)
[thumbnail of 100Thesis-Fadhaa.pdf]
This file may not be suitable for users of assistive technology.
Request an accessible format


This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS).

The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of individuals and testing for association between those variants and a given disease under the assumption of common disease/common variant.

Although GWAS have identified many potential genetic factors in the genome that affect the risks to complex

diseases, there is still much of the genetic heritability that remains unexplained. The power of

detecting new genetic risk variants can be improved by considering multiple genetic variants simultaneously with novel statistical methods.

Improving the analysis of the GWAS data has received much attention from statisticians and other scientific researchers over the past decade.

There are several challenges arising in analysing the GWAS data. First, determining the risk SNPs might be difficult due to non-random correlation between SNPs that can inflate type I and II errors in statistical inference. When a group of SNPs are considered together in the context of haplotypes/genotypes, the distribution of the haplotypes/genotypes is sparse, which makes it difficult to detect risk haplotypes/genotypes in terms of disease penetrance.

In this work, we proposed four new methods to identify risk haplotypes/genotypes based on their frequency differences between cases and controls. To evaluate the performances of our methods, we simulated datasets under wide range of scenarios according to both retrospective and prospective designs.

In the first method, we first reconstruct haplotypes by using unphased genotypes, followed by clustering and thresholding the inferred haplotypes into risk and non-risk groups with a two-component binomial-mixture model. In the method, the parameters were estimated by using the modified Expectation-Maximization algorithm, where the maximisation step was replaced the posterior sampling of the component parameters. We also elucidated the relationships between risk and non-risk haplotypes under different modes of inheritance and genotypic relative risk.

In the second method, we fitted a three-component mixture model to genotype data directly, followed by an odds-ratio thresholding.

In the third method, we combined the existing haplotype reconstruction software PHASE and permutation method to infer risk haplotypes.

In the fourth method, we proposed a new way to score the genotypes by clustering and combined it with a logistic regression approach to infer risk haplotypes.

The simulation studies showed that the first three methods outperformed the multiple testing method of (Zhu, 2010) in terms of average specificity and sensitivity (AVSS) in all scenarios considered. The logistic regression methods also outperformed the standard logistic regression method.

We applied our methods to two GWAS datasets on coronary artery disease (CAD) and hypertension (HT), detecting several new risk haplotypes and recovering a number of the existing disease-associated genetic variants in the literature.

Item Type: Thesis (Doctor of Philosophy (PhD))
Thesis advisor: Zhang, Jian
Thesis advisor: Wang, Xue
Uncontrolled keywords: EM algorithm, mixture model, permutation test, logistic regression, clustering, risk haplotypes, risk genes, coronary artery disease, hypertension, genome wide association, disease-risk haplotypes, WTCCC
Subjects: Q Science > QA Mathematics (inc Computing science) > QA276 Mathematical statistics
Divisions: Divisions > Division of Computing, Engineering and Mathematical Sciences > School of Mathematics, Statistics and Actuarial Science
Funders: Organisations -1 not found.
Depositing User: Users 1 not found.
Date Deposited: 13 Apr 2015 10:14 UTC
Last Modified: 08 Dec 2022 15:22 UTC
Resource URI: (The current URI for this page, for reference purposes)
  • Depositors only (login required):

Total unique views for this document in KAR since July 2020. For more details click on the image.