Design of a Parallel and Scalable Crawler for the Hidden Web

Design of a Parallel and Scalable Crawler for the Hidden Web

Sonali Gupta, Komal Kumar Bhatia
Copyright: © 2022 |Pages: 23
DOI: 10.4018/IJIRR.289612
Article PDF Download
Open access articles are freely available for download

Abstract

The WWW contains huge amount of information from different areas. This information may be present virtually in the form of web pages, media, articles (research journals / magazine), blogs etc. A major portion of the information is present in web databases that can be retrieved by raising queries at the interface offered by the specific database and is thus called the Hidden Web. An important issue is to efficiently retrieve and provide access to this enormous amount of information through crawling. In this paper, we present the architecture of a parallel crawler for the Hidden Web that avoids download overlaps by following a domain-specific approach. The experimental results further show that the proposed parallel Hidden web crawler (PSHWC), not only effectively but also efficiently extracts and download the contents in the Hidden web databases
Article Preview
Top

Introduction

Due to the colossal size of the WWW, search engines have become the imperative tool to search and retrieve information from it (Lawrence & Giles, 1998). The WWW can be divided into two parts: The Surface Web (or Publicly Index able Web) and the Hidden Web. The Surface Web includes the information on the surface of the Web reachable to the crawler purely by following hyperlinks whereas the contents of the Hidden Web are generally stored in the Web database and is available through search form interfaces. The Hidden Web is estimated to be significantly larger and contain much more valuable information than the “Surface Web or PIW” (Bergholz & Chidlovskii, 2003; Ntoulas et al., 2005).

Accessing the contents of the Hidden Web is often time consuming and frustrating for an average user as he/she has to make queries through all potentially relevant Hidden Web databases (Madhavan et al., 2008; Sharma, 2008).

For example, if a user wants to search information about some flight, then in order to get the required information, he/she must go to airline site and fill the details in the search form acting as an interface to the web database. As a result he/she gets the details of the flights available. These types of pages are often referred to as dynamic or hidden web pages.

Figure 1.

A typical search form and a dynamic (or hidden) web page

IJIRR.289612.f01

Figure 1(a) shows an example of such a search form that offers a search over the flights between two cities. Figure 1(b) shows an example of a dynamic or hidden web page.

The crawlers employed by the conventional search engines are not intelligent enough to penetrate through the search interfaces and extract information from the underlying databases and hence does not allow the search engine to create an index from it. Move in the Web structure from a hyperlinked graph to form based search interfaces present the biggest challenge to a crawler’s capability.

To help users better access the Hidden Web, efforts have been made on building intelligent crawlers that can pass through these search interfaces by automatically assigning suitable values to their form fields. But, visiting this huge fraction (80% of the World Wide Web) under the specified time bounds demand efficient and intelligent crawling techniques. A single crawling process cannot scale to the hundreds of thousands of Hidden Web databases on the WWW (Cho & Garcia-Molina, 2002; Diligenti et al., 2000). An alternate to sequential crawling is to have multiple crawling processes in parallel, where each such process loads a distinct page in tandem with others. Thus, the main concern of our paper lies in developing a parallel crawler that can retrieve the contents from both the Surface as well as the Hidden web.

So, a parallel architecture of the Hidden Web crawler that seems to be an improved option in comparison to a single-process crawler has been proposed in this paper. It offers the following advantages:

  • The proposed Parallel Hidden Web crawler not only effectively but also efficiently extracts and download the contents in the Hidden web databases.

  • The proposed crawler minimizes the communication overhead needed to coordinate the parallel processes along with reduced network bandwidth consumption by adopting a domain-specific approach that overcomes the problem of heterogeneity across numerous domains.

Top

System Architecture

Figure 2 presents the design of the proposed crawler in detail. In order to maximize the benefits, the design of the proposed Parallel and Scalable Hidden Web Crawler, PSHWC employs a number of components that work in parallel to offer scalability.

Figure 2.

Proposed Architecture of PSHWC

IJIRR.289612.f02

The proposed Hidden Web crawler, PSHWC consists of the following functional components.

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing