Comparison of machine learning techniques to predict missing cyanobacteria data and trophic states of lakes
MetadataShow full item record
Cyanobacteria are harmful blue green algae that deplete oxygen level of lakes and release toxins that adversely affect aquatic and human life. Hence, cyanobacteria monitoring is crucial. Scientists use satellite data to track cyanobacteria concentration in lakes. However, cloud cover and fog hinder satellite data collection, thereby creating a need to forecast the missing data values. We investigate machine learning approaches on 10 years of satellite data of 99 lakes in South East US. We formulate the missing data problem as a classification problem, and we compare performance of various classifiers which utilize historical data of other lakes for classification. In addition, we conduct a spatio-temporal analysis wherein we leverage matrix factorization techniques to predict missing data. We achieve 88.9% accuracy with Random Forests for 5% data missing from target lake, and observe that Random Forest and k Nearest Neighbors are highly effective to combat missing data problem.