Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

S. Vengadeswaran, S. R. Balasundaram

Source Title: International Journal of Ambient Computing and Intelligence (IJACI) 9(3)

DOI: 10.4018/IJACI.2018070102

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

Article Preview

Top

Introduction

Large volume of data is being generated every day in a variety of domains such as Social networks, Health care, Finance, Telecom, Government sectors etc. The data which these domains generate are voluminous (GB, PB, and TB), varied (structured, semi-structured or unstructured) and ever increasing at an unprecedented pace (Jain & Bhatnagar, 2016; Manogaran & Lopez, 2017). Big-Data is thus the term applied to such large volume of data sets whose size is beyond the ability of the commonly used software tools to capture, manage, and process within a tolerable elapsed time (Bihl et al., 2016; White, 2012). Processing such large volume of data and retrieving usable information can be strenuous job in computation, this has led to the use of Hadoop to analyze and gain insights from the data (Baumgarten et al., 2013). The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models (https://hadoop.apache.org/; Narayanapppa et al., 2016; Sammer, 2012). Local storage and computation is achieved through the two core components of Hadoop namely Hadoop Distributed File System (HDFS) and Map Reduce (MR). HDFS is a Distributed File System, designed for storing massive data reliably and streaming the data with high bandwidth (Shvachko et al., 2010). By optimizing the storage and computation of HDFS, the queries can be solved earlier.

Hadoop follows master slave architecture with one Name-Node and multiple Data-Nodes. Whenever a file is pushed into HDFS for storage, the file splits into number of data blocks of desired size, placed randomly across the available DN. When executing a query, Meta information is obtained from the Name-Node about the locality of the required blocks and then query executed in Data-Node where required blocks are located. The most important feature of Hadoop is this movement of the computation to the data rather than the way around (Dean & Ghemawat, 2008). Hence the position of data across the Data-Nodes plays a significant role in efficient query processing. Hence, we focus on finding an innovative data placement strategy so that the queries are solved at the earliest possible time to enable quick decisions as well as to derive maximum utilization of resources. Since the real value of the analyzing Big-Data is, accelerating the time-to-answer especially in streaming data where immediate response for taking better decision is much desired. (Lee et al., 2014; Wang et al., 2014; Yuan et al., 2010).

Figure 1.

Need for an optimal data placement - Illustration

The time taken to execute a query and return the results increase exponentially as the data size increases, leading to more waiting time for the user. The complexity of query execution is influenced by volume of data, the amount of data requested from the query, the type of data, complexity of the data etc. Sometimes the wait times could range from minutes, to hours, to days and to weeks in some worst cases. One of the major reasons for slow speed of executions could be due to the non-availability of the blocks required for execution locally so that the data has to be transferred across the network for execution, leading to increased execution time as shown in the Figure 1.

Complete Article List

Search this Journal:

Reset

Volume 15: 1 Issue (2024)

Volume 14: 1 Issue (2023)

Volume 13: 6 Issues (2022): 1 Released, 5 Forthcoming

Volume 12: 4 Issues (2021)

Volume 11: 4 Issues (2020)

Volume 10: 4 Issues (2019)

Volume 9: 4 Issues (2018)

Volume 8: 4 Issues (2017)

Volume 7: 2 Issues (2016)

Volume 6: 2 Issues (2014)

Volume 5: 4 Issues (2013)

Volume 4: 4 Issues (2012)

Volume 3: 4 Issues (2011)

Volume 2: 4 Issues (2010)

Volume 1: 4 Issues (2009)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

Abstract

Introduction

Complete Article List