Generic Subsequence Matching Framework:
Modularity, Flexibility, Eﬃciency
David Novak, Petr Volny, and Pavel Zezula
Masaryk University, Brno, Czech Republic
{david.novak,xvolny1,zezula}@fi.muni.cz
Abstract. Subsequence matching has appeared to be an ideal approach
for solving many problems related to the ﬁelds of data mining and similarity
retrieval. It has been shown that almost any data class (audio,
image, biometrics, signals) is or can be represented by some kind of time
series or string of symbols, which can be seen as an input for various subsequence
matching approaches. The variety of data types, speciﬁc tasks
and their solutions is so wide that their proper comparison and combination
suitable for a particular task might be very complicated and
time-consuming. In this work, we present a new generic Subsequence
Matching Framework (SMF) that tries to overcome the aforementioned
problem by a uniform frame that simpliﬁes and speeds up the design, development
and evaluation of subsequence matching related systems. We
identify several relatively separate subtasks solved diﬀerently over the literature
and SMF enables to combine them in a straightforward manner
achieving new quality and eﬃciency. The strictly modular architecture
and openness of SMF enables also involvement of eﬃcient solutions from
diﬀerent ﬁelds, for instance advanced metric-based indexes.
1 Introduction
A large fraction of the data being produced in current digital era is in the form
of time series or can be transformed into sequences of numbers. This concept
is very natural and ubiquitous: audio signals, various biometric data, image
features, economic data, etc. are often viewed as time series and need to be also
organized and searched in this way.
One of the key research issues drawing a lot of attention during the last two
decades is the subsequence matching problem, which can be basically formulated
as follows: Given a query sequence, ﬁnd the best-matching subsequence from
the sequences in the database. Depending on the speciﬁc data and application,
this general problem has many variants – query sequences of ﬁxed or variable
size, data-speciﬁc deﬁnition of sequence matching, requirement of dynamic time
warping, etc. Therefore, the eﬀort in this research area resulted in many approaches
and techniques – both, very general and those focusing on a speciﬁc
fragment of this complex problem.
The leading authors in this ﬁeld identiﬁed two main problems that limit the
comparability and cooperation potential of various approaches: the data bias (algorithms
are often evaluated on heterogeneous datasets) and the implementation
S.W. Liddle et al. (Eds.): DEXA 2012, Part II, LNCS 7447, pp. 256–265, 2012.
c Springer-Verlag Berlin Heidelberg 2012
General Subsequence Matching Framework 257
bias (the implementation of the speciﬁc technique can strongly inﬂuence experiment
results) [1]. The eﬀort to overcome the data bias is expressed by founding
a common set of data collections [2] which is publicly available and that should
be used by any consequent research in this area. However, the implementation
bias lingers, which also obstructs a straightforward combination of compatible
approaches whose interconnection could be fruitful.
Analysis of this situation brought us to conclusion that there is a need for
a uniﬁed environment for developing, prototyping, testing, and combination of
subsequence matching approaches. After a brief overview and analysis of current
state of the ﬁeld (Section 2), we propose a generic subsequence matching framework
(SMF) in Section 3. Section 4 contains a detailed example of design and
realization of a subsequence matching algorithm with the aid of SMF. The paper
concludes in Section 5 by future research directions that cover possible performance
boost enabled by a straightforward cooperation of SMF with advanced
distance-based indexing and searching techniques [3,4]. Due to space limitations,
an extended version of this work is available as a technical report [5].
2 Subsequence Matching Approaches
The ﬁeld opening paper by Faloutsos et al. [6] introduced a subsequence matching
application model that has been used ever since only with smaller modiﬁcations.
The model can be summarized in the following four steps that should be adopted
by a subsequence matching application:
slicing of the time series sequences (both data and query) into shorter subsequences
(of a ﬁxed length),
transforming each subsequence into lower dimension,
indexing the subsequences in a multi-dimensional index structure,
searching in the index with a distance measure that obeys the lower bounding
lemma on the transformed data.
Originally [6], this approach was demonstrated on a subsequence matching algorithm
that used the sliding window approach to slice the indexed data and
disjoint window for the query. The Discrete Fourier Transformation (DFT) was
used for dimensionality reduction and the data was indexed using the minimum
bounding rectangles in R-Tree [7]. The Euclidean distance was used for searching
since it satisﬁes the lower bounding lemma on data transformed by DFT.
The data representation and the choice of distance function are fundamental
questions for each speciﬁc application. Current approaches regarding these
questions were thoroughly overviewed and analyzed [5] with a conclusion that
the questions of data representation and distance function can be practically
separate from the speciﬁc subsequence matching algorithm; it is important to
pair the data representation and the distance function wisely in order to satisfy
the lower bounding lemma [6].
The work by Faloutsos et al. [6] encouraged many following works. Moon
et al. [8] suggested a dual approach for slicing and indexing sequences. This
258 D. Novak, P. Volny, and P. Zezula
DualMatch uses the sliding windows for queries and disjoint windows for data
sequences to reduce the number of windows that are indexed. DualMatch was
followed by the generalization of windows creation method called GeneralMatch
[9]. Another signiﬁcant leap forward was made by the eﬀort of Keogh et al. in
their work about exact indexing of Dynamic Time Warping [10]. They introduced
a similarity measure that is relatively easy to compute and it lower-bounds the
expensive DTW function. This approach was further enhanced by improving I/O
part of the subsequence matching process using Deferred Group Subsequence
Retrieval introduced in [11].
If we focus on the performance side of the system, we have to employ enhancements
like indexing, lower bounding, window size optimization, reducing I/O
operations or approximate queries. Lots of approaches for building subsequence
matching applications often use the very same techniques for solving common
sub-tasks included in the whole retrieval process and changes only some parts
with some novel approach. As a result, the same parts of the process (like DFT
or DWT) are implemented repeatedly which leads to the phenomenon of the
implementation bias [1]. The modern subsequence matching approaches [11,10]
employ many smaller tasks in the retrieval process that solve sub-problems like
optimizing I/O operations. Implementations of routines that solve such subproblems
should be reusable and employable in similar approaches. This led
us to think about the whole subsequence matching process as a chain of subtasks,
each solving a small part of the problem. We have observed that many of
the published approaches ﬁt into this model and their novelty is often only in
reordering, changing or adding new subtask implementation into the chain.
3 Subsequence Matching Framework
In this section, we describe the general Subsequence Matching Framework (SMF)
that is currently available under GPL license at http://muﬁn.ﬁ.muni.cz/smf/.
The framework can be perceived on the following two levels:
– on the conceptual level, the framework is composed of mutually cooperating
modules, each of which solves a speciﬁc sub-task, and these modules are
cooperating within speciﬁc subsequence matching algorithms;
– on the implementation level, SMF deﬁnes the functionality of individual
module types and their communication interfaces; a subsequence matching
algorithm is implemented as a skeleton that combines module types in a speciﬁc
way and these can be instantiated by actual module implementations.
In Section 3.1, we describe the common sub-problems (sub-tasks) that we identiﬁed
in the ﬁeld and we deﬁne corresponding module types (conceptual level).
Further, in Section 3.2, we justify our approach by describing fundamental subsequence
algorithms in terms of SMF modules and we present a straightforward
implementation of these algorithms within SMF. Section 3.3 is devoted to details
about implementation of the framework.
General Subsequence Matching Framework 259
Table 1. Notation used throughout this paper
Symbol Deﬁnition
S[k] the k-th value of the sequence S
S[i : j] subsequence of S from S[i] to S[j], inclusive
S.len the length of sequence S
S.id the unique identiﬁer of sequence S
S .pid if S is subsequence of S then S .pid = S.id
S .oﬀset if S = S[i : j] then S .oﬀset = i and S .len = j − i + 1
D(Q, S) distance between two sequences Q and S
The key term in the whole framework is, naturally, a sequence. As we want
to keep the framework as general as possible, we do not lay practically any restrictions
on the components of the sequence – it can be integers, real numbers,
vectors of numbers, or any more sophisticated structures. The sequence similarity
functions are deﬁned relatively independently of speciﬁc sequence type (see
Section 3.3). In the following, we will use the notation summarized in Table 1.
3.1 Common Sub-problems: Modules in SMF
Studying the ﬁeld of subsequence matching, we identiﬁed several common subproblems
addressed by a number of approaches in some sense. Speciﬁcally, we
can see the following sub-tasks that correspond to individual modules in SMF.
Data Representation (Module: Data Transformer). The raw data sequences
entering an application are often transformed into other representation
which can be motivated either by simple dimensionality reduction (DFT, DWT,
SVD, PAA) [6,12,13,14] or also by extracting some important characteristics
that should improve the eﬀectiveness of the retrieval [15]. In either case, the
general task can be deﬁned simply as follows: Transform given sequence S into
another sequence S . We will use the symbol in Figure 1 (a) for this data transformer
module. The following table summarizes information about this module
and gives a few examples of speciﬁc approaches implementing this functionality.
data transformer transform sequence S into sequence S
DFT apply the DFT on sequence of real numbers S [6]
PAA apply the PAA on sequence of real numbers S [14]
Landmarks extract landmarks from sequence S [15]
Windows and Subsequences (Module: Slicer). Majority of the subsequence
matching approaches partitions the data and/or query sequences into
subsequences of, typically, ﬁxed length (windows) [6,8,9,11]. Again, this task can
be isolated, well deﬁned, and the implementation can be reused in many variants
of subsequence matching algorithms. Partitioning a sequence S, each resulting
subsequence S = S[i : j] has S .pid = S.id, S .oﬀset = i, and S .len = j − i + 1.
260 D. Novak, P. Volny, and P. Zezula
Fig. 1. Types of SMF modules and their notation
The module will be denoted as in Figure 1 (b) and its description and speciﬁc
examples are as follows:
sequence slicer partition S into list of subsequences S1, . . . , Sn
disjoint slicer partition S disjointly into subsequences of length w [6]
sliding slicer use sliding window of size w to partition S [6]
Sequence Distances (Module: Distance Function). There is a high number
of speciﬁc distance functions D that can be evaluated between two sequences
S and T . The intention of SMF is to partially separate the distance functions
from the data and to use the speciﬁc distance function as a parameter of the
algorithm (see Section 3.3 for details on realization of this independence). Of
course, it is the matter of conﬁguration to use appropriate function for respective
data type, e.g. to preserve the lower bounding property. The distance functions
symbol is in Figure 1 (c) and it can be summarized as follows:
distance function evaluate dissimilarity of sequences S, T
Lp metrics evaluate distance Lp on equally long number sequences
DTW use DTW on any pair of number sequences S, T [16]
ERP calculate Edit distance with Real Penalty on S, T [17]
LB PAA, LB Keogh measures which lower-bound the DTW [10]
Eﬃcient Indexing (Module: Distance Index). An eﬃcient subsequencematching
algorithm typically employs an index to eﬃciently evaluate distancebased
queries on the stored (sub-)sequences using the query-by-example paradigm
(QBE). Again, we see the choice of the speciﬁc index as a relatively separate component
of the whole algorithm and thus as an exchangeable module. Also, we see
a space for improvement in boosting the eﬃciency of this component in future.
We denote this module as in Figure 1 (d):
distance index evaluate eﬃciently distance-based QBE queries
R-Tree family index sequences as n-dimensional spatial data
iSAX tree use a symbolic representation of the sequences [18,19]
metric indexes index and search the data according to mutual distances [3]
General Subsequence Matching Framework 261
Fig. 2. Schema of the fundamental subsequence matching algorithm [6]
Eﬃcient Aligning (Module: Sequence Storage). The approaches that
use sequence slicing typically also need to store the original whole sequences.
The slice index (for window size w) returns a set of candidate subsequences S ,
S .len = w each matching some query subsequence Q such that Q .len = w. If
the query sequence Q is actually longer than w, the subsequent task is to align Q
to corresponding subsequence S[i : (i+Q.len −1)] where i = S .oﬀset −Q .oﬀset
and S.id = S .pid. To do this aligning for each S in the candidate set may be
very demanding. For smaller datasets, this can be done in memory with no special
treatment, but more advanced approaches are proﬁtable on disk [11]. We
will call this module sequence storage (Figure 1 (e)) and it is speciﬁed as follows:
sequence storage store sequences S and return S[i : j] for given S.id
hash map basic hash map evaluating queries one by one
deferred retrieval deferred group sequence retrieval (I/O eﬃcient) [11]
3.2 Subsequence Matching Strategies in SMF
Staying at the conceptual level, let us have a look at the whole subsequence
matching algorithms and their composition from individual modules introduced
above. As an example, we take again the fundamental algorithm [6] for general
subsequence matching of queries Q, Q.len ≥ w for an established window size
w. The schema of a slight modiﬁcation of this algorithm is in Figure 2. The
solid lines correspond to data insertion and the dash lines (with italic labels)
correspond to the query processing.
A data sequence S is ﬁrst partitioned by the sliding window approach (sliding
slicer module) into slices S = S[i : (i + w − 1)], these are transformed
by Discrete Fourier Transformation (data transformer module DFT), and the
Minimum Bounding Rectangles (MBR) of these transformed slices are stored
in an R∗
-tree storage (distance index module); the original sequences S is also
stored (whole sequence storage module). Processing a subsequence query, the
query sequence Q is partitioned using the disjoint slicer module, each slice Q
262 D. Novak, P. Volny, and P. Zezula
(Q .len = w) is transformed by DFT and it is searched within the slice index (using
L2 distance or a simple binary function intersect). For each of the returned
candidate subsequences S , a query-corresponding alignment S[i : (i+Q.len−1)]
is retrieved from the whole storage (see above for details) and the candidate set
is reﬁned using L2 distance D(Q, S[i : (i + Q.len − 1)]).
Preserving the skeleton of an algorithm (module types and their cooperation),
one can substitute individual modules with other compatible modules obtaining
a diﬀerent processing eﬃciency or even a fundamentally diﬀerent algorithm. For
instance, swapping the sliding and disjoint slicer modules practically results in
the DualMatch approach [8].
3.3 Implementation
The SMF was not implemented from scratch but with the aid of framework
MESSIF [20]. The MESSIF is a collection of Java packages supporting mainly
development of metric-based search approaches. SMF uses especially the following
MESSIF functionality:
– encapsulation of the concept of data objects and distances,
– implementation of the queries and query evaluation process,
– distance based indexes (building, querying),
– conﬁguration and management of the algorithm via text conﬁg ﬁles.
The sequence is in SMF handled very generally; it is deﬁned as an interface which
requires that each speciﬁc sequence type (e.g. a ﬂoat sequence) must, among
other, specify the distance between two sequence components d(S[i], S [j]). For
number sequences, this distances could be, naturally, absolute value of diﬀerences
d(S[i], S [j]) = |S[i]−S [j]|, but one can imagine complex sequence components,
for instance vectors where d could be an Lp metric. Implementation of a sequence
distance D(S, S ) (for instance, DTW) then treats S and S only as general
sequences that use the component distance d and, thus, this implementation can
be independent of speciﬁc sequence type.
4 Example of Subsequence Algorithm with SMF
An algorithm is within SMF implemented as a skeleton – module types, their
connections via speciﬁed interfaces, and all algorithm-speciﬁc operations. The
algorithm is then conﬁgured and instantiated by a text conﬁguration ﬁle – module
types required by the algorithm skeleton are ﬁlled by speciﬁc modules. Let
us describe this principle on an example of a simple algorithm for general subsequence
matching with variable query length – see Figure 3 for this skeleton
schema. It uses two slicer modules, one distance index with a distance function,
and a sequence storage for the whole sequences (again, with a distance function).
In order to run this algorithm, we have to instantiate these module types with
speciﬁc modules. Figure 4 shows the key part of a SMF conﬁguration ﬁle that
starts such algorithm. On the ﬁrst two lines, the sliding slicer module is deﬁned
General Subsequence Matching Framework 263
Fig. 3. Skeleton of a simple VariableQueryAlgorithm algorithm
slidingSlicer = namedInstanceAdd
slidingSlicer.param.1 = smf.modules.slicer.SlidingSlicer(<w>)
disjointSlicer = namedInstanceAdd
disjointSlicer.param.1 = smf.modules.slicer.DisjointSlicer(<w>)
index = namedInstanceAdd
index.param.1 = smf.modules.index.ApproxAlgorithmDistanceIndex(mIndex)
seqStorage = namedInstanceAdd
seqStorage.param.1 = smf.modules.seqstorage.MemorySequenceStorage()
startSearchAlg = algorithmStart
startSearchAlg.param.1 = smf.algorithms.VariableQueryAlgorithm
startSearchAlg.param.2 = smf.sequence.impl.SequenceFloatL2
startSearchAlg.param.3 = seqStorage
startSearchAlg.param.4 = index
startSearchAlg.param.5 = slidingSlicer
startSearchAlg.param.6 = disjointSlicer
startSearchAlg.param.7 = <w>
Fig. 4. SMF conﬁguration ﬁle for VariableQueryAlgorithm algorithm
Fig. 5. Demonstration of VariableQueryAlgorithm: http://muﬁn.ﬁ.muni.cz/subseq/
264 D. Novak, P. Volny, and P. Zezula
by an action called namedInstanceAddwhich creates an instance slidingSlicer
of class smf.modules.slicer.SlidingSlicer (with parameter w); the disjoint
slicer is created accordingly. Then, the instance of distance index is created; it
is a self-standing algoritm, namely a metric index M-Index [4] (we assume that
the instance mIndex has been already created). In this example, the sequence
storage is instantiated as a simple memory storage (seqStorage).
Finally, the actual VariableQueryAlgorithm is started passing the created
module instances as parameters to the skeleton. The param.2 of this action speciﬁes
that this particular algorithm instance requires sequences of ﬂoating point
numbers and will compare them by Euclidean distance. Such SMF conﬁguration
ﬁles are processed directly by the MESSIF framework that enables creation of
such algorithm and its eﬃcient management.
This algorithm is demonstrated by a publicly available demo on a set of real
number sequences compared by L2 distance (http://muﬁn.ﬁ.muni.cz/subseq/).
Figure 5 shows screenshot of the GUI of this demo: The user can specify a
subsequence (oﬀset and width) of the query sequence and the most similar subsequences
are located within the indexes set.
5 Conclusions and Future Work
The sequence data is all around us in various forms and extensive volumes. The
research in the area of subsequence matching has been very intensive, resulting
in many full or partial solutions in various sub-areas. In this work, we have
identiﬁed several sub-tasks that circulate over the ﬁeld and are tackled within
various subsequence matching approaches and algorithms.
We present a generic subsequence matching framework (SMF) that brings
the option of choosing freely among the existing partial solutions and combining
them in order to achieve ideal solutions for heterogeneous requirements of
diﬀerent applications. Also, this framework overcomes the often mentioned implementation
bias present in the ﬁeld and it enables a straightforward utilization
of techniques from diﬀerent areas, for instance advanced metric indexes. We describe
SMF on conceptual and implementation levels and present an example
of a design and realization of subsequence algorithm with the aid of SMF. The
SMF is available under GPL license at http://mufin.fi.muni.cz/smf/.
The architecture of the framework is strictly modular and thus one of natural
directions of future development is implementation of other modules. Also, we
will develop SMF according to requirements emerging from continuous research
streams that utilize SMF. Finally and most importantly, we would like to contribute
to the eﬃciency of the subsequence matching systems by involvement of
advanced metric indexes. We believe in a positive impact of such cooperation of
these two research ﬁelds that were so far evolving relatively separately.
Acknowledgments. This work was supported by national research projects
GACR 103/10/0886, and GACR P202/10/P220.
General Subsequence Matching Framework 265
References
1. Keogh, E., Kasetty, S.: On the Need for Time Series Data Mining Benchmarks: A
Survey and Empirical Demonstration. In: Proceedings of ACM SIGKDD 2002, pp.
102–111. ACM Press (2002)
2. Keogh, E., Zhu, Q., Hu, B., Hay, Y., Xi, X., Wei, L., Ratanamahatana, C.A.: The
UCR Time Series Classiﬁcation/Clustering Homepage (2011)
3. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space
Approach. Springer (2006)
4. Novak, D., Batko, M., Zezula, P.: Metric Index: An Eﬃcient and Scalable Solution
for Precise and Approximate Similarity Search. Information Systems 36(4), 721–733
(2011)
5. Novak, D., Volny, P., Zezula, P.: Generic Subsequence Matching Framework: Modularity,
Flexibility, Eﬃciency. Technical report, arXiv:1206.2510v1 (2012)
6. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast Subsequence Matching in
Time-Series Databases. ACM SIGMOD Record 23(2), 419–429 (1994)
7. Guttman, A.: R-Trees: A Dynamic Index Structure for Spacial Searching. ACM
SIGMOD Record 14(2), 47–57 (1984)
8. Moon, Y.S., Whang, K.Y., Loh, W.K.: Duality-Based Subsequence Matching in
Time-Series Databases. In: Proceedings of the 17th International Conference on
Data Engineering, p. 263 (2001)
9. Moon, Y.S., Whang, K.Y., Han, W.S.: General Match: A Subsequence Matching
Method in Time-series Databases Based on Generalized Windows. In: International
Conference on Management of Data, p. 382 (2002)
10. Keogh, E., Ratanamahatana, C.A.: Exact indexing of dynamic time warping.
Knowledge and Information Systems 7(3), 358–386 (2004)
11. Han, W.S., Lee, J., Moon, Y.S., Jiang, H.: Ranked Subsequence Matching in Timeseries
Databases. In: Proceedings VLDB 2007, pp. 423–434. ACM (2007)
12. Chan, K.P., Fu, A.W.C.: Eﬃcient Time Series Matching by Wavelets. In: Proceedings
ICDE 1999, pp. 126–133 (1999)
13. Korn, F., Jagadish, H.V., Faloutsos, C.: Eﬃciently supporting ad hoc queries in
large datasets of time sequences. ACM SIGMOD Record 26(2), 289–300 (1997)
14. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality Reduction
for Fast Similarity Search in Large Time Series Databases. Knowledge and Information
Systems 3(3), 263–286 (2001)
15. Perng, C.S., Wang, H., Zhang, S.R., Parker, D.S.: Landmarks: A New Model for
Similarity-based Pattern Querying in Time Series Databases. In: Proceedings of
ICDE 2000, pp. 33–42. IEEE Computer Society, Washington, DC (2000)
16. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken
word recognition. IEEE Transactions on Acoustics Speech and Signal Processing
26(1), 43–49 (1978)
17. Chen, L., Ng, R.: On the Marriage of Lp-norms and Edit Distance. In: Proceedings
of VLDB 2004, pp. 792–803 (2004)
18. Shieh, J., Keogh, E.: i SAX. In: Proceeding of the 14th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD 2008, p. 623. ACM
Press, New York (2008)
19. Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and Mining
One Billion Time Series. In: 2010 IEEE International Conference on Data Mining,
pp. 58–67. IEEE (2010)
20. Batko, M., Novak, D., Zezula, P.: MESSIF: Metric Similarity Search Implementation
Framework. In: Thanos, C., Borri, F., Candela, L. (eds.) Digital Libraries:
R&D. LNCS, vol. 4877, pp. 1–10. Springer, Heidelberg (2007)