An Algorithm for Multi-Domain Website Classification

An Algorithm for Multi-Domain Website Classification

Mohammad Aman Ullah, Anika Tahrin, Sumaiya Marjan
DOI: 10.4018/IJWLTT.2020100104
Article PDF Download
Open access articles are freely available for download

Abstract

The web is the largest world-wide communication system of computers. The web has local, academic, commercial and government sites. As the types of websites increases in numbers, the cost and accuracy of manual classification became cumbersome and cannot satisfy the increasing internet service demands, thereby automated classification became important for better and more accurate search engine results. Therefore, this research has proposed an algorithm for classifying different websites automatically by using randomly collected textual data from the webpages. This research also contributed ten dictionaries covering different domains and used as training data in the classification process. Finally, the classification was carried out using the proposed and Naïve Bayes algorithms and found the proposed algorithm outperformed on the scale of accuracy by 1.25%. This research suggests that the proposed algorithm could be applied to any number of domains if the related dictionaries are available.
Article Preview
Top

Introduction

A website may be an assortment of web content, images, videos or alternative digital assets that are hosted on one or more internet server sometimes accessible via the net. Websites are frequently devoted to a selected issue, starting from diversion and social networking to providing news and education. With the intensification in the variety of sites, the requirement for website classification gains attraction (Wang et al., 2010). Website classification is a very challenging issue and needs human expertise if it is done manually. The work cost of these standard classifications is also winding up progressively high, and this classification has turned out to be gradually troublesome (Deng, 2012). To overcome the usual classification problem of the websites, many machine learning algorithms such as naive Bayes, support vector machine, random forest, etc. have been used by the researchers in their works. In most of the research works, classic algorithms were used for the classification and classified only single domain.

This research has proposed an algorithm for classifying different websites automatically by using randomly collected textual data from the web pages. This research also contributed ten dictionaries covering different domains and used as training data in the classification process. The classification was done by both the proposed and Naïve Bayes algorithm and found the proposed algorithm outperform the naïve Bayes on the scale of accuracy by 1.25%. This study suggests that the proposed algorithm could be applied to any number of domains provided that the related dictionaries are available.

Therefore, the contributions of this research are:

  • 1.

    Proposal of an algorithm to classify the websites of different domains such as food, business, education, shopping, travel, and social media, etc.;

  • 2.

    Creation of different dictionaries to characterize the said domains;

  • 3.

    Improvement of the accuracy of web search.

This paper is structured as follows: section 2 includes a narrative of related works; section 3 represents the problem Statement. In Section 4, the description of the methodology is provided. Section 5 contains the details of data collection and preprocessing. Section 6 includes description regarding experiments and proposed algorithm. Section 7 is all about experiment results and analysis. The comparison is discussed in section 8. Finally, in section 9 conclusions and future work directions are discussed.

Top

Most of the work done so far emphasized the classification using classic classifier and classify at most two to three domains. (Patil et al., 2012) applied a Naïve Bayes algorithm to categorize the websites using the content of the homepages. As per them, web pages could be classified to a more specific category using different feature sets. (Roul et al., 2014) have classified the Web Document using the Association Mining technique. The classification was done using the frequent itemsets created by the Frequent Pattern (FP) Growth algorithm. Final classification was done on the feature set by Naïve Bayes classifier. A simple method was proposed by (Slamet et al., 2018) for web scraping to find the job vacancy from the Search Engine using Naïve Bayes classifier. (Klassen et al.,2010) works on Web document classification by keywords using random forests, their experiment showed that, increasing in domain reduces the accuracy of the classifier.

Complete Article List

Search this Journal:
Reset
Volume 19: 1 Issue (2024)
Volume 18: 2 Issues (2023)
Volume 17: 8 Issues (2022)
Volume 16: 6 Issues (2021)
Volume 15: 4 Issues (2020)
Volume 14: 4 Issues (2019)
Volume 13: 4 Issues (2018)
Volume 12: 4 Issues (2017)
Volume 11: 4 Issues (2016)
Volume 10: 4 Issues (2015)
Volume 9: 4 Issues (2014)
Volume 8: 4 Issues (2013)
Volume 7: 4 Issues (2012)
Volume 6: 4 Issues (2011)
Volume 5: 4 Issues (2010)
Volume 4: 4 Issues (2009)
Volume 3: 4 Issues (2008)
Volume 2: 4 Issues (2007)
Volume 1: 4 Issues (2006)
View Complete Journal Contents Listing