Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Efficient Identification of Similar XML Fragments Based on Tree Edit Distance

Hongzhi Wang, Jianzhong Li, Fei Li

Source Title: XML Data Mining: Models, Methods, and Applications

DOI: 10.4018/978-1-61350-356-0.ch004

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Similarity detection between large XML fragment sets is broadly used in many applications such as data integration and XML de-duplication. Extensive methods are used to find similar XML fragments, such as the pq-gram state-of-the-art method which allows for relatively high join quality and efficiency. In this chapter, we propose pq-hash as an improvement to pq-grams. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. To efficiently perform similarity join on XML fragment sets, in this chapter we propose a cluster-based partition strategy as well as a sort-merge & hash join strategy to avoid nested loop join. Both our theoretical analysis and experimental results confirm that, while retaining high join quality, pq-hash gains much higher efficiency than pq-grams, and our strategies for approximate join are effective.

Chapter Preview

Top

Introduction

Thanks to its ability to represent data from a wide variety of sources, XML has rapidly emerged as the new standard for data representation and exchange on the Internet. Given the flexibility of XML, data in autonomous sources which represent the same real-world object may not be exactly the same. Thus, similarity detection techniques are often applied to find XML fragments representing the same real-world object. In this chapter we refer to a real-world application in the Municipality of Bozen in order to illustrate the use of techniques for detecting similar XML fragments. In this context, the GIS office in that municipality maintains maps of the city area, so that one would like to enrich such maps with information retrieved from various databases of the municipality as well as external institutions. Residential addresses turn out to play a pivotal role in this process since they have to be used to access and link relevant information. However, performing exact join on the street names would yield poor results since street names are different in different databases due to, e.g., spelling mistakes, different naming conventions, and renamed streets which are not always updated in all databases. Moreover, in the bilingual region of Bozen, each street has typically two names, and these are often used interchangeably. Since these data can be modeled as ordered, labeled trees, methods for the detection of similar XML fragments can be used to effectively match the data representing the same real-world data.

A widely used approach to the evaluation of similarity between XML documents is to compute their edit distance (Cobena, Abiteboul, & Marian, 2002; Guha, Jagadish, Koudas, Srivastava, & Yu, 2002; Lee, Choy, & Cho 2004). Since XML documents are often modeled as ordered, labeled trees, the tree edit distance between any two such trees is defined as the minimum number of node insertions, deletions and relabelings to transform a tree into another (Tai, 1979). It is well-known that the edit distance behaves well but is computationally expensive. Many works such as (Zhang & Shasha, 1989; Klein, 1998; Chen, 2001; Demaine, Mozes, Rossman, & Weimann, 2007) have been proposed to improve the efficiency in the computation of tree edit distance. However, all of them have more than O(n³) runtime, where n is the tree size, and hence they do not scale to large trees. Since it is hard to improve the efficiency fundamentally by optimizing the tree edit algorithms independently, transformation-based methods are often adopted in order to transform trees into other data structures whose similarities are easier to evaluate.

An example of computation of tree edit distance is shown in Figure 1. To transform the XML fragment T to the XML fragment T₃, the following operations are made: insertion of node i, relabeling of node c with x, and deletion of node g. Since this is the minimum cost transformation sequence, the tree edit distance between T and T₃ is equal to 3.

Figure 1.

Example of computation of tree edit distance computation: the left-most tree is transformed into the right-most tree

The pq-gram method is known as an effective and efficient transformation-based tree similarity detection method (Augsten, Böhlen, & Gamper, 2005). In this method, each tree is split into a small subtree bag. The pq-distance between the pq-bags is used to describe the distance between their corresponding trees. Both theoretical analysis and experimental results confirm the detection quality based on pq-grams. An “optimized join” (Augsten, Böhlen, Dyreson, & Gamper, 2008) was also presented to accelerate the detection process by taking advantage of the diversity of trees in a forest. However, this optimized join cannot always improve efficiency. In many cases, such as the cases in which a large part of tree pairs (in the Cartesian product of any two tree sets) show high similarity, the efficiency of the optimized join is nearly as low as that of the nested loop join.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Efficient Identification of Similar XML Fragments Based on Tree Edit Distance

Abstract

Introduction

Complete Chapter List