Cluster analysis for symbolic interval data using linear regression method
MetadataShow full item record
Symbolic data records are becoming a more powerful instrument to deal with large size data sets. Interval-valued data are a special type of symbolic data, for which each observation is a vector of intervals. The typical K-means methods for interval-valued data suppose the data separate to spherical clusters. It usually cannot converge to the correct clusters if the data are not clustering spherically. We propose a K-regressions based clustering method for interval-valued data to recover a more complicated data structure. Assuming the response and predictor variables follow K di erent linear relationships, the data are initially split into K groups randomly. Then, we apply the new developed symbolic variation" least squares to estimate the parameters of the K symbolic regressions. A data point is then relocated to its closest group in terms of its symbolic distance to the regression lines. This two-step dynamic clustering algorithm continues until the clusters are stable. Further, we introduce an orthogonal regression clustering algorithm (ORCA) for interval-value data to avoid specifying a response variable. Two orthogonal regression methods are proposed: the simple orthogonal regression method and the general orthogonal regression method. We utilize four di erent methods to determine the optimal number of clusters. Simulation study is conducted to investigate the performance of the ORCA algorithm. We use the Iris data (Fisher, 1936) to test the e ectiveness of the ORCA algorithm.