Application of machine learning in malware file classification
MetadataShow full item record
In this thesis multiple advanced machine learning algorithms, such as random forests, boosted treess, support vector machine, etc., were applied to investigate the problem of malware file classification. Feature engineering procedures were performed on a large dataset (∼ 400G) of malware files provided by Kaggle.com. Four different feature sets were generated: filesizes and header string frequency, byte-sequence n-grams, opcode n-grams and image features. Each of these feature sets was studied individually at first, and then different combinations of them were investigated in detail. Moreover, the importance of different features was studied and discussed as well.