Genetic sequence classification and phylogenetic construction with N-gram methods
Kincaid, Chandler Jordan
MetadataShow full item record
This work presents two papers in which machine learning techniques are used to analyze genetic sequence information. In the first paper feature vectors were created from protein data of the influenza virus using N-gram methods common to text classification. A number of classifiers were trained on the feature vector and were successful in predicting influenza host organisms of corresponding viral strains. The best classifier achieved an accuracy of 97.2% on a set of over 700,000 sequences, the largest experiment of its kind to date. The second paper explores N-gram feature vectors in phylogenetic construction. Methods are presented which speed up feature vector creation by 26% over the state of the art. GPU optimized functions were examined in the distance matrix calculation task of phylogenetic construction and showed up to a 33x speed up in comparison with CPU based methods.