Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

An Empirical Study on Initializing Centroid in K-Means Clustering for Feature Selection

Amit Saxena, John Wang, Wutiphol Sintunavarat

Source Title: International Journal of Software Science and Computational Intelligence (IJSSCI) 13(1)

DOI: 10.4018/IJSSCI.2021010101

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

One of the main problems in K-means clustering is setting of initial centroids which can cause misclustering of patterns which affects clustering accuracy. Recently, a density and distance-based technique for determining initial centroids has claimed a faster convergence of clusters. Motivated from this key idea, the authors study the impact of initial centroids on clustering accuracy for unsupervised feature selection. Three metrics are used to rank the features of a data set. The centroids of the clusters in the data sets, to be applied in K-means clustering, are initialized randomly as well as by density and distance-based approaches. Extensive experiments are performed on 15 datasets. The main significance of the paper is that the K-means clustering yields higher accuracies in majority of these datasets using proposed density and distance-based approach. As an impact of the paper, with fewer features, a good clustering accuracy can be achieved which can be useful in data mining of data sets with thousands of features.

Article Preview

Top

1. Introduction

The curse of dimensionality is a major problem in large datasets. A dimension is commonly known by names like feature or attribute or property or even column in a dataset. In order to save more and more information, many irrelevant features are also preserved in a dataset and these features can be contributing nothing while classifying the dataset for taking some inference out of it and sometimes even adding to misclassification of patterns. A dataset with large dimensionality may increase the time and space complexity wile classifying it. More specifically, the performance of a classifier depends on several factors: i) number of training instances. ii) Dimensionality, i.e., number of features, and iii) complexity of the classifier (Saxena et al., 2010). Feature selection is an important component in pattern recognition (Duda et al., 2001). Feature Selection can be done in supervised or unsupervised manner. When feature selection techniques use the knowledge of class given in the data sets, it is called supervised feature selection. Feature selection without using class information is called unsupervised feature selection. For unsupervised feature selection, Mitra (Mitra et al., 2010), proposed a method that partitions original feature set into distinct subsets or clusters so that features in one cluster are highly similar while those in different clusters are dissimilar. A single feature is then selected from each cluster to form a reduced feature subset. Feature Selection for clustering is discussed in (Dash et al., 2000). Dy and Brodley (2000) presented a wrapper framework for feature selection, clustering and order identification concurrently. Basu (Basu et al., 2000), discussed several methods for feature selection based on maximum entropy and maximum likelihood criteria but the proposed strategy for feature selection depends on the method used to estimate uni-variate data. Pal et al. (2000) proposed an unsupervised neuro-fuzzy feature ranking method. They used a criterion to measure the similarity between two patterns in the original feature space and in a transformed feature space. The transformed feature space is obtained by multiplying each feature by a coefficient w in interval [0,1]. This coefficient is learned through a feed-forward neural network. After training, the features are ranked according to the values of these weights. Higher values of w_i indicate higher importance and hence higher ranks. Using this rank, the required number of features is selected. A new correlation-based approach to feature selection (CFS) is presented in work from Hall (2000). CFS uses the features' predictive performances and inter-correlations to guide its search for a good subset of features. Experiments on discrete and continuous class datasets reveal that CFS can drastically reduce the dimensionality of datasets while maintaining or improving the performance of learning algorithms. The redundancy between two random variables X and Y is used to define a test of redundancy in (Heydon, 1971). This test can be used to eliminate redundant features without degrading performance of classifiers. Features that are linearly dependent on other features do not contribute towards pattern classification by linear techniques. In order to detect the linearly dependent features, a measure of linear dependence is proposed in (Das, 1971).

Complete Article List

Search this Journal:

Reset

Volume 16: 1 Issue (2024)

Volume 15: 1 Issue (2023)

Volume 14: 4 Issues (2022): 1 Released, 3 Forthcoming

Volume 13: 4 Issues (2021)

Volume 12: 4 Issues (2020)

Volume 11: 4 Issues (2019)

Volume 10: 4 Issues (2018)

Volume 9: 4 Issues (2017)

Volume 8: 4 Issues (2016)

Volume 7: 4 Issues (2015)

Volume 6: 4 Issues (2014)

Volume 5: 4 Issues (2013)

Volume 4: 4 Issues (2012)

Volume 3: 4 Issues (2011)

Volume 2: 4 Issues (2010)

Volume 1: 4 Issues (2009)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

An Empirical Study on Initializing Centroid in K-Means Clustering for Feature Selection

Abstract

1. Introduction

Complete Article List