Sample size determination in multi-class classification and prediction based on single-nucleotide polymorphisms
MetadataShow full item record
Single-nucleotide polymorphisms (SNPs), believed to determine human differences, are widely used to predict risk of diseases and class membership of subjects. In the literature, several supervised machine learning methods, such as, support vector machine, neural network and logistic regression, are available for classification. Typically, however, samples for training a machine are limited and/or the sampling cost is high. Thus, it is essential to determine the minimum sample size needed to construct a classifier based on SNP data. Such a classifier would facilitate correct classification while keeping the sample size to a minimum, thereby making the studies cost-effective. In this dissertation, for coded SNP data from two classes, an optimal classifier and an approximation to its probability of correct classification (PCC) are derived. A linear classifier is constructed and an approximation to its PCC is also derived. These approximations are validated through a variety of Monte Carlo simulations. A sample size determination algorithm based on the criterion which ensures that the difference between the two approximate PCCs is below a threshold, is given. For the HapMap data on Chinese and Japanese populations, a linear classifier is built using 51 independent SNPs, and the required total sample sizes are determined using our algorithm. For coded SNP data from D(>=2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the PCC for each classifier. These approximations are used to evaluate the associated Area Under the Receiver Operating Characteristic (ROC) Curve (AUCs) or Volume Under the ROC hyper-Surface (VUSs). We give an algorithm for sample size determination, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of this algorithm is also illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined. We also illustrate the usefulness of our sample size determination algorithm in a prediction problem using a Heterogeneous Stock Mice data.