Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Design of a Parallel and Scalable Crawler for the Hidden Web

Sonali Gupta, Komal Kumar Bhatia

Source Title: International Journal of Information Retrieval Research (IJIRR) 12(1)

DOI: 10.4018/IJIRR.289612

Article PDF Download Open access articles are freely available for download

Abstract

The WWW contains huge amount of information from different areas. This information may be present virtually in the form of web pages, media, articles (research journals / magazine), blogs etc. A major portion of the information is present in web databases that can be retrieved by raising queries at the interface offered by the specific database and is thus called the Hidden Web. An important issue is to efficiently retrieve and provide access to this enormous amount of information through crawling. In this paper, we present the architecture of a parallel crawler for the Hidden Web that avoids download overlaps by following a domain-specific approach. The experimental results further show that the proposed parallel Hidden web crawler (PSHWC), not only effectively but also efficiently extracts and download the contents in the Hidden web databases

Article Preview

Top

Introduction

Due to the colossal size of the WWW, search engines have become the imperative tool to search and retrieve information from it (Lawrence & Giles, 1998). The WWW can be divided into two parts: The Surface Web (or Publicly Index able Web) and the Hidden Web. The Surface Web includes the information on the surface of the Web reachable to the crawler purely by following hyperlinks whereas the contents of the Hidden Web are generally stored in the Web database and is available through search form interfaces. The Hidden Web is estimated to be significantly larger and contain much more valuable information than the “Surface Web or PIW” (Bergholz & Chidlovskii, 2003; Ntoulas et al., 2005).

Accessing the contents of the Hidden Web is often time consuming and frustrating for an average user as he/she has to make queries through all potentially relevant Hidden Web databases (Madhavan et al., 2008; Sharma, 2008).

For example, if a user wants to search information about some flight, then in order to get the required information, he/she must go to airline site and fill the details in the search form acting as an interface to the web database. As a result he/she gets the details of the flights available. These types of pages are often referred to as dynamic or hidden web pages.

Figure 1.

A typical search form and a dynamic (or hidden) web page

Figure 1(a) shows an example of such a search form that offers a search over the flights between two cities. Figure 1(b) shows an example of a dynamic or hidden web page.

The crawlers employed by the conventional search engines are not intelligent enough to penetrate through the search interfaces and extract information from the underlying databases and hence does not allow the search engine to create an index from it. Move in the Web structure from a hyperlinked graph to form based search interfaces present the biggest challenge to a crawler’s capability.

To help users better access the Hidden Web, efforts have been made on building intelligent crawlers that can pass through these search interfaces by automatically assigning suitable values to their form fields. But, visiting this huge fraction (80% of the World Wide Web) under the specified time bounds demand efficient and intelligent crawling techniques. A single crawling process cannot scale to the hundreds of thousands of Hidden Web databases on the WWW (Cho & Garcia-Molina, 2002; Diligenti et al., 2000). An alternate to sequential crawling is to have multiple crawling processes in parallel, where each such process loads a distinct page in tandem with others. Thus, the main concern of our paper lies in developing a parallel crawler that can retrieve the contents from both the Surface as well as the Hidden web.

So, a parallel architecture of the Hidden Web crawler that seems to be an improved option in comparison to a single-process crawler has been proposed in this paper. It offers the following advantages:

•
The proposed Parallel Hidden Web crawler not only effectively but also efficiently extracts and download the contents in the Hidden web databases.
•
The proposed crawler minimizes the communication overhead needed to coordinate the parallel processes along with reduced network bandwidth consumption by adopting a domain-specific approach that overcomes the problem of heterogeneity across numerous domains.

Top

System Architecture

Figure 2 presents the design of the proposed crawler in detail. In order to maximize the benefits, the design of the proposed Parallel and Scalable Hidden Web Crawler, PSHWC employs a number of components that work in parallel to offer scalability.

Figure 2.

Proposed Architecture of PSHWC

The proposed Hidden Web crawler, PSHWC consists of the following functional components.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024)

Volume 13: 1 Issue (2023)

Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming

Volume 11: 4 Issues (2021)

Volume 10: 4 Issues (2020)

Volume 9: 4 Issues (2019)

Volume 8: 4 Issues (2018)

Volume 7: 4 Issues (2017)

Volume 6: 4 Issues (2016)

Volume 5: 4 Issues (2015)

Volume 4: 4 Issues (2014)

Volume 3: 4 Issues (2013)

Volume 2: 4 Issues (2012)

Volume 1: 4 Issues (2011)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Design of a Parallel and Scalable Crawler for the Hidden Web

Abstract

Introduction

System Architecture

Complete Article List