Article Preview
Top1. Introduction
Support vector machine is a kind of machine learning method which is put forward by Vapnik and others based on statistical learning theory. In view of its avoiding effectively local minimum value problem, good generalization performance and good classification accuracy, SVM has been applied more and more widely in pattern recognition, regression analysis and feature extraction for recent years, which has become an international new research hotspot in the field of artificial intelligence and machine learning. However, the big learning samples bring about slowly learning speed and large storage demand, which directly obstruct the SVM technique application. Moreover, for training the sample data mingled with outlier data in the relatively class of sample, it often can not improve the classification capability. On the contrary, it will greatly increase the burden of the training calculation, and may also cause over learning so as to increase the VC dimension of classification discriminant functions, which largen the confidence interval, finally affect the generalization of SVM. Therefore, it appears a lot of improved algorithm of support vector machine (Agarwal, 2002; Daniael & Cao, 2004; Luo, 2007; Xiao, Li, & Zhang, 2006; Li, Wang, & Yuan, 2003; Zeng, 2007; Tan & Ding, 2008; Cao, Liu, & Zhang, 2006).
The reduction strategy in the article Zeng (2007) and Cao, Liu, and Zhang (2006) is presented based on the idea of class center. After obtaining the clustering center of the positive and negative sample, it reduces the training sample by determining the provision radius relationship between the sample and the clustering center. But this method is suitable only for the sample set with convex set; In the article Xiao, Li, and Zhang (2006), it makes restriction on the training sample by C-mean clustering method, that if the all samples of a group are from the same class, a clustering center instead, otherwise, reserving all the samples of the group. Therefore, that method is able to reduce effectively the non-convex training sample set, but when the cluster number less than 1/20 of the sample, the reduction effect is not obvious. For the larger sample set, with the increase of the cluster number, it will increase the cost of calculation time against the sample reduction, the method does not have practical significance; The NN-SVM algorithm proposed in the article Li, Wang, and Yuan (2003) is according to the similarities between the nearest class with each sample to determine accepting or rejecting. This method can not only reduce the size of the samples but also reduce the SVM generalization performance influence caused by outlier data, but it will spend a lot of time when looking for the nearest point of each sample points. For the larger sample set terms, the algorithm efficiency is extremely low, which also lost practical significance; Another reduction PSCC strategy is presented in article Luo (2007), that according to the geometry characteristic of the training sample which linear divided in the high dimension space to realize the reduction of samples, by calculating the angle between per sample and positive-negative clustering center of attachment in the high dimensional space. That is a kind of algorithm with practical significance, but which requires a lot of nuclear calculation, so the efficiency is not high.
In view of the above analysis of the basic idea of the existing improved algorithm and their advantages and disadvantages, this paper put forward a new large scale training samples reduction strategy based on the theory of point set for support vector machine (SVM). Aiming at the large scale training samples mingled with the class outlier data, it can effectively reduce sample scale and the influence of the classification discriminant functions caused by the outlier data mixed in the relative class, so as to increase the training speed without affecting the SVM classification performance.