Improving learning outcomes by using clustering validity analysis to reduce label uncertainty
U, Man Chon
MetadataShow full item record
When people make critical decisions, they often consider the opinions of multiple experts from different domains rather than committing themselves to a single expert or relying on their own judgment. The same principle applies in machine learning for the construction of robust predictive models. However, the outcome of each expert/model might include a certain degree of uncertainty, which affects the robustness of the combined predictive model. In this dissertation, we present two effective machine learning frameworks (ROMP and VAMO) to tackle challenging problems in the domains of cancer prediction and malware clustering. Our frameworks successfully improve the learning outcomes from multiple sources by using cluster validity analysis to reduce label uncertainty. In ROMP, we introduce novel features for identifying cancer-causing mutations in the cancer kinome, and utilize various feature selection methods to evaluate our proposed features. We combine multiple classifiers for the prediction of rare oncogenic mutations, followed by using the Expectation Maximization (EM) clustering algorithm and our self-invented cluster validity metrics to improve the learning outcomes from the ensemble classifier as well as to identify suspicious mutations in the dataset. Our framework successfully discovered rare causative mutations that were later confirmed by lab experiments. Another framework -- VAMO, provides a fully automated assessment of the quality of malware clustering results. VAMO does not require a manual mapping between malware family labels output by different AV scanners. Furthermore, VAMO does not discard malware samples for which a majority voting-based consensus cannot be reached. Instead, VAMO explicitly deals with the inconsistencies typical of multiple AV labels to build a more representative reference set. Our evaluation, which includes extensive experiments in a controlled setting and a real-world application, show that VAMO performs better than majority voting-based approaches, and provides a way for malware analysts to automatically assess the quality of their malware clustering results.