An Empirical Study on Initializing Centroid in K-Means Clustering for Feature Selection

An Empirical Study on Initializing Centroid in K-Means Clustering for Feature Selection

Amit Saxena, John Wang, Wutiphol Sintunavarat
DOI: 10.4018/IJSSCI.2021010101
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

One of the main problems in K-means clustering is setting of initial centroids which can cause misclustering of patterns which affects clustering accuracy. Recently, a density and distance-based technique for determining initial centroids has claimed a faster convergence of clusters. Motivated from this key idea, the authors study the impact of initial centroids on clustering accuracy for unsupervised feature selection. Three metrics are used to rank the features of a data set. The centroids of the clusters in the data sets, to be applied in K-means clustering, are initialized randomly as well as by density and distance-based approaches. Extensive experiments are performed on 15 datasets. The main significance of the paper is that the K-means clustering yields higher accuracies in majority of these datasets using proposed density and distance-based approach. As an impact of the paper, with fewer features, a good clustering accuracy can be achieved which can be useful in data mining of data sets with thousands of features.
Article Preview
Top

1. Introduction

The curse of dimensionality is a major problem in large datasets. A dimension is commonly known by names like feature or attribute or property or even column in a dataset. In order to save more and more information, many irrelevant features are also preserved in a dataset and these features can be contributing nothing while classifying the dataset for taking some inference out of it and sometimes even adding to misclassification of patterns. A dataset with large dimensionality may increase the time and space complexity wile classifying it. More specifically, the performance of a classifier depends on several factors: i) number of training instances. ii) Dimensionality, i.e., number of features, and iii) complexity of the classifier (Saxena et al., 2010). Feature selection is an important component in pattern recognition (Duda et al., 2001). Feature Selection can be done in supervised or unsupervised manner. When feature selection techniques use the knowledge of class given in the data sets, it is called supervised feature selection. Feature selection without using class information is called unsupervised feature selection. For unsupervised feature selection, Mitra (Mitra et al., 2010), proposed a method that partitions original feature set into distinct subsets or clusters so that features in one cluster are highly similar while those in different clusters are dissimilar. A single feature is then selected from each cluster to form a reduced feature subset. Feature Selection for clustering is discussed in (Dash et al., 2000). Dy and Brodley (2000) presented a wrapper framework for feature selection, clustering and order identification concurrently. Basu (Basu et al., 2000), discussed several methods for feature selection based on maximum entropy and maximum likelihood criteria but the proposed strategy for feature selection depends on the method used to estimate uni-variate data. Pal et al. (2000) proposed an unsupervised neuro-fuzzy feature ranking method. They used a criterion to measure the similarity between two patterns in the original feature space and in a transformed feature space. The transformed feature space is obtained by multiplying each feature by a coefficient w in interval [0,1]. This coefficient is learned through a feed-forward neural network. After training, the features are ranked according to the values of these weights. Higher values of wi indicate higher importance and hence higher ranks. Using this rank, the required number of features is selected. A new correlation-based approach to feature selection (CFS) is presented in work from Hall (2000). CFS uses the features' predictive performances and inter-correlations to guide its search for a good subset of features. Experiments on discrete and continuous class datasets reveal that CFS can drastically reduce the dimensionality of datasets while maintaining or improving the performance of learning algorithms. The redundancy between two random variables X and Y is used to define a test of redundancy in (Heydon, 1971). This test can be used to eliminate redundant features without degrading performance of classifiers. Features that are linearly dependent on other features do not contribute towards pattern classification by linear techniques. In order to detect the linearly dependent features, a measure of linear dependence is proposed in (Das, 1971).

Complete Article List

Search this Journal:
Reset
Volume 16: 1 Issue (2024)
Volume 15: 1 Issue (2023)
Volume 14: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing