Regression Leaf Forest
Abstract
There are a number of learning methods that provide solutions to classification and regression problems, including Linear Regression, Decision Trees, KNN, and SVMs. These methods work well in many applications, but they are challenged for real world problems that are noisy, non-linear or high dimensional. Furthermore, missing data (e.g., missing historical features of companies in stock data), is not managed well by current approaches. We present an implementation of a hybrid learning system that combines an ensemble of decision trees (Random Forest) with of Linear Regression. Linear Regression (LR) is fast but not accurate because it assumes linearity, while Random Forests are not as fast as LR but have been shown to be accurate for high dimensional and large data sets. By combining these approaches we address the weaknesses of each approach and exploit their strengths both in terms of real time performance and accuracy.
In this thesis, we evaluate a hybrid Random Forest and Linear Regression implementation called "Regression Leaf Forest", which is a forest of trees with regression leaves for supervised learning problems. The approach extends Random Forests by introducing Linear Regression learners at the leaf nodes of the trees for predicting functions. Our empirical analysis on both real and artificial data shows that the proposed algorithm requires less computation time for both large and high-dimensional datasets while providing comparable or better accuracy when compared to: Single Tree, a Single Linear Regression Tree, and Random Forest algorithms.
URI
http://purl.galileo.usg.edu/uga_etd/ganesan_sivanesan_201105_mshttp://hdl.handle.net/10724/27128