Article Preview
TopExisting DT algorithms are usually implemented with “black-box”- approach. Thereby the user specifies input data and parameters used for definition of appropriate model. Induction procedure is hidden from the user. The user has no possibility to change algorithms in order to improve such results. These algorithms are very hard to analyze and evaluate, because it is hard to determine which part of the algorithm had influence on its performance. Certain part of the algorithm is the best with one data set, while with the other data set, corresponding part of some other algorithm can be better. In this approach performance testing of different parts of algorithms over data sets, as well as combination of the most efficient parts from different algorithms, are not possible. Better performance with these algorithms can be achieved with incremental improvement of existing algorithms.
One of the first “black-box” DT algorithms is ID3 (Quinlan, 1986). This algorithm works only with categorical variables, it is based on “multi-way”-split and it uses “Information Gain”- measure for split quality. This evaluation measure is biased towards choosing attributes with more categories. Breiman, Friedman, Stone and Olshen (1984), proposed CART algorithm which works with both categorical and numerical variables, and for split evaluation it uses “Gini” measure. The algorithm supports only “binary”- splits. Algorithm C4.5 (Quinlan, 1993) is improvement of ID3 algorithm which can work both, with categorical and numerical data. It uses “multi-way”- split for categorical, and “binary” for numerical data. For split evaluation it uses “Gain Ratio”-measure, which is not biased towards attributes with several categories. It also includes three pruning algorithms: reduced error pruning, pessimistic error pruning and error based pruning. CHAID algorithm was proposed by Kass (1980). In this algorithm Chi-square test is used for evaluation of the split quality. QUEST algorithm (Loah & Shih, 1997), uses removal of insignificant attributes with chi-square test, for categorical, and ANOVA f-test, for numerical data.