Open source distributed spectral clustering
Joshi, Ankita Prashant
MetadataShow full item record
In this thesis, we propose an approximation of the Spectral Clustering algorithm to cluster high-dimensional large datasets. Spectral Clustering is limited in its application as it suffers from a scalability problem in both memory use and computational time when the size of the data is very large. In this work, we modify the spectral clustering algorithm by developing an open source framework for approximate spectral clustering on a distributed level using Apache Spark. To perform clustering on large data sets we implement a distributed design of the algorithm and we construct a sparse affinity matrix using approximate nearest neighbors. The design reduces the affinity matrixs construction time and storage requirements without significantly affecting the accuracy of clustering. Experimental results on synthetic and real datasets demonstrate the effectiveness of our framework with respect to both time consumption and accuracy.