Analyzing the performance of machine learning algorithms on metagenomic data
MetadataShow full item record
Metagenomics is a branch of bioinformatics that deals with the study and analysis of micro-organisms in natural environments. Some micro-organisms including many species of bacteria, archea and viruses should be studied in their natural habitat as these organisms cannot be cultivated in the laboratory by using standard techniques. Machine learning techniques are being applied to this field to predict novel genes. In this thesis, we try to address the issue of classifying metagenomic sequences. First, we compare the performance of several machine learning approaches including ensemble learners to identify which algorithms will be able to bin metagenomic data into taxa-specific bins with high accuracy. Then we do scalability studies to investigate how the performance of those algorithms degrades as the number of species in the metagenomic sample increases. We also study the performance degradation with the increase in the number of unknown sequences in the data. The results are very promising and show that machine learning algorithms perform very well in this domain. Futhermore, the performance degrades gracefully with the increase in the number of species and the number of unknown sequences.