A New Similarity Metric for Sequential Data

A New Similarity Metric for Sequential Data

Pradeep Kumar, Bapi S. Raju, P. Radha Krishna
DOI: 10.4018/978-1-61350-474-1.ch014
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this chapter, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.
Chapter Preview
Top

Sequence Similarity

A sequence is made of set of items that happen in time, or happen one after another, that is, in position but not necessarily in relation with time. We can say that a sequence is an ordered set of items. A sequence is denoted as follows:S = <a1, a2, …, an>where a1, a2, …, an are the item sets in sequence S. Sequence S contains n elements or ordered item sets. Sequence length is defined as the count of number of item sets contained in the sequence. It is denoted as |S| and here, |S| = n. Formally, similarity is a nonnegative real valued function S, defined on the Cartesian product X × X of a set X. It is called a metric on X if for every x,y∈ X, the following properties are satisfied by S.

Complete Chapter List

Search this Book:
Reset