Uncertainty of the dependent variable in genome-wide association studies
Smith, Shannon Nicole
MetadataShow full item record
The use of high throughput technology has the potential to provide insight into the underlying biological mechanisms of several important complex traits. Unfortunately this information comes at a cost, sometimes in terms of low accuracy and poor quality of the data. One application of genomics in humans is the diagnoses of diseases such as bipolar disorder, Alzheimer’s disease, and cancer. However in order to correctly predict disease statuses, methods should be trained on datasets containing no errors. In the case for most conditions this is seldom the case, as mis-diagnostic errors are prevalent due to overlapping symptoms and lack of precise diagnostic tools. Therefore, a new approach for dealing with misclassification was developed and applied to simulated data sets where case and control observations were randomly switched and with varying odds ratios of influential SNPs, to examine the effects of potential misclassification on diagnostic accuracy. The cases when misclassification was ignored resulted in limited predictive power of the model. When the misclassification algorithm was applied, the predictive power increased across all scenarios demonstrating the effectiveness of the misclassification algorithm. Additionally in livestock applications, genomic technology is used to detect genetic variants associated with economically important traits as well as to estimate genomic enhanced breeding values to be used in animal selection. For genome wide association studies (GWAS) in animal applications the dependent variable is often a ‘pseudo’ phenotype (estimated breeding values, de-regressed breeding values, etc.). Being estimates, these pseudo-phenotypes carry a certain level of inaccuracy or uncertainty. In some situations, such uncertainty is large and it is not constant across observations. Consequently, using these estimates directly as dependent variables in the GWAS can be problematic because the residual variance of the model is composed of two components (sampling variance and the error variance) that current methods are unable to accommodate. Thus, we developed and implemented a new procedure that correctly accounts for both components of the residual variance leading to an increase in accuracies of the estimated genomic breeding values. The proposed method was evaluated with real and simulated data.