Statistical methods for the analysis of complex genomic data
Robbins, Kelly R.
MetadataShow full item record
The use of genomic technology has the potential to provide invaluable insight into the mechanisms of several important traits. Unfortunately this information comes at a cost, in terms of the high-dimensions and sometimes poor quality of the data. One potential application of genomics is the diagnosis of diseases, such as Alzheimer’s disease, with ambiguous and confounding clinical markers. Of course to predict disease statuses, an algorithm must first be trained using a data set in which disease statuses are known without error. In the case of incipient Alzheimer’s disease this is rarely the case. To this end a misclassification algorithm was applied to a data set containing healthy individuals and incipient Alzheimer’s patients to examine the effects of potential misclassification on diagnostic accuracy. Results obtained without invoking the misclassification algorithm showed limited predictive power of the model. When the misclassification algorithm was invoked significant increase in the model’s predictive ability were obtained. These results demonstrate the utility of the misclassification algorithm in data sets containing potential misdiagnosis. In addition to potential misdiagnosis, the high-dimensions of genomic data sets can also pose substantial issues for statistical analysis. Due to the large number of features in many genomic datasets, explicit modeling of gene interactions is often infeasible. To eliminate the need for simplifying assumptions a machine learning algorithm, referred to as the ant colony algorithm (ACA), was adapted for analysis of high-dimension genomic data. In a study examining the selection of predictive gene expression patterns, the performance of the ACA was compared to several standard methodologies. When applied to high-dimensional data sets, the ACA was able to identify small subsets of highly predictive genes, yielding superior prediction accuracy when compared to several standard feature selection methods. In an application involving single nucleotide polymorphism marker data, a modified ACA was implemented to identify markers associated with a binary trait under the influence of interacting loci. When compared to marginal effects models, the ACA demonstrated superior performance under several simulation scenarios with p-values for associated SNP being more significant using the ACA, resulting in substantial increases in power.