Enhancing the quality of high dimensional noisy data for classification and regression problems
Abstract
In classification and regression problems, classifiers for high dimensional noisy data suffer from the concurrent negative effects of noise and high dimensionality. Noise disrupts data and high dimensionality prevents the classifier from focusing on relevant features; potentially reducing classification and regression accuracies. However, most noise detection techniques cannot be used for high dimensional data and many dimensionality reduction methods are not applicable to noisy data. The goal of this dissertation is to enhance the quality of high dimensional noisy data by simultaneously removing noise and providing relevant features. To achieve that we propose the NDFS algorithm which relies on two genetic algorithms, one for noise detection (GA-ND) and the other for feature selection (GA-FS), which exchange their results periodically at certain generation intervals. Also prototype selection (PS-ND) is used together with the genetic algorithm to improve the performance of the noise detection part. Our experimental results show that while the sequential application of noise detection and feature selection methods may not overcome the concurrent negative effects of noise and high dimensionality, the NDFS algorithm succeeds in this and achieves high performance by simultaneous noise removal and feature selection. We demonstrate that the NDFS algorithm substantially increases the classification accuracies and reduces the error rates, and show that it significantly enhances the quality of high dimensional noisy data for both classification and regression problems.