Design and analysis issues in high dimension, low sample size problems
Safo, Sandra Esi
MetadataShow full item record
Advancement in technology and computing power have led to the generation of data with enormous amount of variables when compared to the number of observations. These types of data, also known as high dimension, low sample size, are plagued with different challenges that either require modifications of existing traditional methods or development of new statistical methods. One of these challenges is the development of Sparse methods that use only a fraction of the variables. Sparse methods have been shown to perform better at making predictions on real high dimensional problems, hence justifying their studies and use in practice. This dissertation considers three novel methods for designing and analyzing high dimensional studies. We first propose new sample size method to estimate the number of samples required in a training set when allocating new entities into two groups. The methodology exploits the structural similarity between logistic regression prediction and errors-in-variables models. Secondly, we consider the problem of assigning future observations to known classes using linear discriminant analysis. We propose a new classification approach of generalizing existing binary linear discriminant methods to multiclass methods. Our methodology utilizes the equivalence between discriminant subspace using Fisher's linear discriminant analysis and basis vectors of between class scatter. We apply the proposed method to two sparse methods. Thirdly, a general framework that results in sparse vectors for many multivariate statistical methods is developed. The framework uses the relationship between many multivariate statistical problems and generalized eigenvalue problem. We illustrate this framework with two multivariate statistical methods- linear discriminant analysis for classifying new entities into more than two groups, and canonical correlation analysis for studying associations between two different high dimensional data types. The effectiveness of the proposed methods in this dissertation is evaluated by various simulated processes and real data analyses on microarray and RNA sequencing (RNA-seq) data.