Edit Distance for Pushdown Automata
Krishnendu Chatterjee(B)
, Thomas A. Henzinger, Rasmus Ibsen-Jensen,
and Jan Otop
IST Austria, Wien-Umgebung, Austria
krish.chat@gmail.com
Abstract. The edit distance between two words w1, w2 is the minimal
number of word operations (letter insertions, deletions, and substitutions)
necessary to transform w1 to w2. The edit distance generalizes to
languages L1, L2, where the edit distance is the minimal number k such
that for every word from L1 there exists a word in L2 with edit distance
at most k. We study the edit distance computation problem between
pushdown automata and their subclasses. The problem of computing
edit distance to pushdown automata is undecidable, and in practice, the
interesting question is to compute the edit distance from a pushdown
automaton (the implementation, a standard model for programs with
recursion) to a regular language (the speciﬁcation). In this work, we
present a complete picture of decidability and complexity for deciding
whether, for a given threshold k, the edit distance from a pushdown
automaton to a ﬁnite automaton is at most k.
1 Introduction
Edit distance. The edit distance [13] between two words is a well-studied metric,
which is the minimum number of edit operations (insertion, deletion, or substitution
of one letter by another) that transforms one word to another. The edit
distance between a word w to a language L is the minimal edit distance between
w and words in L. The edit distance between two languages L1 and L2 is the
supremum over all words w in L1 of the edit distance between w and L2.
Signiﬁcance of edit distance. The notion of edit distance provides a quantitative
measure of “how far apart” are (a) two words, (b) words from a language, and
(c) two languages. It forms the basis for quantitatively comparing sequences,
a problem that arises in many diﬀerent areas, such as error-correcting codes,
natural language processing, and computational biology. The notion of edit distance
between languages forms the foundations of a quantitative approach to
veriﬁcation. The traditional qualitative veriﬁcation (model checking) question
is the language inclusion problem: given an implementation (source language)
This research was funded in part by the European Research Council (ERC) under
grant agreement 267989 (QUAREM), by the Austrian Science Fund (FWF) projects
S11402-N23 (RiSE) and Z211-N23 (Wittgenstein Award), FWF Grant No P23499N23,
FWF NFN Grant No S11407-N23 (RiSE), ERC Start grant (279307: Graph
Games), and MSR faculty fellows award.
c Springer-Verlag Berlin Heidelberg 2015
M.M. Halld´orsson et al. (Eds.): ICALP 2015, Part II, LNCS 9135, pp. 121–133, 2015.
DOI: 10.1007/978-3-662-47666-6 10
122 K. Chatterjee et al.
deﬁned by an automaton AI and a speciﬁcation (target language) deﬁned by
an automaton AS, decide whether the language L(AI) is included in the language
L(AS) (i.e., L(AI ) ⊆ L(AS)). The threshold edit distance (TED) problem
is a generalization of the language inclusion problem, which for a given integer
threshold k ≥ 0 asks whether every word in the source language L(AI) has
edit distance at most k to the target language L(AS) (with k = 0 we have the
traditional language inclusion problem). For example, in simulation-based veriﬁcation
of an implementation against a speciﬁcation, the measured trace may
diﬀer slightly from the speciﬁcation due to inaccuracies in the implementation.
Thus, a trace of the implementation may not be in the speciﬁcation. However,
instead of rejecting the implementation, one can quantify the distance between
a measured trace and the speciﬁcation. Among all implementations that violate
a speciﬁcation, the closer the implementation traces are to the speciﬁcation, the
better [5,7,10]. The edit distance problem is also the basis for repairing speciﬁcations
[2,3].
Our models. In this work we consider the edit distance computation problem
between two automata A1 and A2, where A1 and A2 can be (non)deterministic
ﬁnite automata or pushdown automata. Pushdown automata are the standard
models for programs with recursion, and regular languages are canonical to
express the basic properties of systems that arise in veriﬁcation. We denote by
DPDA (resp., PDA) deterministic (resp., nondeterministic) pushdown automata,
and DFA (resp., NFA) deterministic (resp., nondeterministic) ﬁnite automata.
We consider source and target languages deﬁned by DFA, NFA, DPDA, and
PDA. We ﬁrst present the known results and then our contributions.
Previous results. The main results for the classical language inclusion problem
are as follows [11]: (i) if the target language is a DFA, then it can be solved in
polynomial time; (ii) if either the target language is a PDA or both source and
target languages are DPDA, then it is undecidable; (iii) if the target language
is an NFA, then (a) if the source language is a DFA or NFA, then it is PSpacecomplete,
and (b) if the source language is a DPDA or PDA, then it is PSpacehard
and can be solved in ExpTime (to the best of our knowledge, there is
a complexity gap where the upper bound is ExpTime and the lower bound is
PSpace). The TED problem was studied for DFA and NFA, and it is PSpacecomplete,
when the source and target languages are given by DFA or NFA [2,3].
Our contributions. Our main contributions are as follows.
1. We show that the TED problem is ExpTime-complete, when the source language
is given by a DPDA or a PDA, and the target language is given by
a DFA or NFA. We present a hardness result which shows that the TED
problem is ExpTime-hard for source languages given as DPDA and target
languages given as DFA. We present a matching upper bound by showing
that for source languages given as PDA and target languages given as NFA
the problem can be solved in ExpTime. As a consequence of our lower bound
we obtain that the language inclusion problem for source languages given by
DPDA (or PDA) and target languages given by NFA is ExpTime-complete.
Thus we present a complete picture of the complexity of the TED problem,
Edit Distance for Pushdown Automata 123
Table 1. Complexity of the language inclusion problem from C1 to C2. Our results are
boldfaced.
C2 = DFA C2 = NFA C2 = DPDA C2 = PDA
C1 ∈ {DFA, NFA}
PTime
PSpace-c PTime
C1 ∈ {DPDA, PDA} ExpTime-c (Th. 2) undecidable
Table 2. Complexity of FED(C1, C2). Our results are boldfaced. See Conjecture 14 for
the open complexity problem of C1 ∈ {DPDA, PDA} and C2 = DFA.
C2 = DFA C2 = NFA C2 = DPDA C2 = PDA
C1 ∈ {DFA, NFA} coNP-c [3] PSpace-c [3] open (Conj. 18)
C1 ∈ {DPDA, PDA}
coNP-hard [3] ExpTime-c
undecidable (Prop. 15)
in ExpTime (Th. 8) (Th. 8)
Table 3. Complexity of TED(C1, C2). Our results are boldfaced.
C2 = DFA C2 = NFA C2 = DPDA C2 = PDA
C1 ∈ {DFA, NFA} PSpace-c [2] undecidable (Prop. 17)
C1 ∈ {DPDA, PDA} ExpTime-c (Th. 2 (1)) undecidable
and in addition we close a complexity gap in the classical language inclusion
problem. In contrast, if the target language is given by a DPDA, then
the TED problem is undecidable even for source languages given as DFA.
Note that the interesting veriﬁcation question is when the implementation
(source language) is a DPDA (or PDA) and the speciﬁcation (target language)
is given as DFA (or NFA), for which we present decidability results
with optimal complexity.
2. We also consider the ﬁnite edit distance (FED) problem, which asks whether
there exists k ≥ 0 such that the answer to the TED problem with threshold
k is YES. For ﬁnite automata, it was shown in [2,3] that if the answer to
the FED problem is YES, then a polynomial bound on k exists. In contrast,
the edit distance can be exponential between DPDA and DFA. We present
a matching exponential upper bound on k for the FED problem from PDA
to NFA. Finally, we show that the FED problem is ExpTime-complete when
the source language is given as a DPDA or PDA, and the target language
as an NFA.
Our results are summarized in Tables 1, 2 and 3. Due to space constraints we
omit some technical proofs, which are presented in the full version [6].
Related work. Algorithms for edit distance have been studied extensively for
words [1,12,13,15–17]. The edit distance between regular languages was studied
in [2,3], between timed automata in [8], and between straight line programs
in [9,14]. A near-linear time algorithm to approximate the edit distance for a
word to a Dyck language has been presented in [18].
124 K. Chatterjee et al.
2 Preliminaries
2.1 Words, Languages and Automata
Words. Given a ﬁnite alphabet Σ of letters, a word w is a ﬁnite sequence of
letters. For a word w, we deﬁne w[i] as the i-th letter of w and |w| as its length.
We denote the set of all words over Σ by Σ∗
. We use to denote the empty
word.
Pushdown Automata. A (non-deterministic) pushdown automaton (PDA) is
a tuple (Σ, Γ, Q, S, δ, F), where Σ is the input alphabet, Γ is a ﬁnite stack
alphabet, Q is a ﬁnite set of states, S ⊆ Q is a set of initial states, δ ⊆ Q ×
Σ × (Γ ∪ {⊥}) × Q × Γ∗
is a ﬁnite transition relation and F ⊆ Q is a set of
ﬁnal (accepting) states. A PDA (Σ, Γ, Q, S, δ, F) is a deterministic pushdown
automaton (DPDA) if |S| = 1 and δ is a function from Q × Σ × (Γ ∪ {⊥}) to
Q × Γ∗
. We denote the class of all PDA (resp., DPDA) by PDA (resp., DPDA).
We deﬁne the size of a PDA A = (Σ, Γ, Q, S, δ, F), denoted by |A|, as |Q| + |δ|.
Runs of Pushdown Automata. Given a PDA A and a word w = w[1] . . . w[k]
over Σ, a run π of A on w is a sequence of elements from Q × Γ∗
of length k + 1
such that π[0] ∈ S × { } and for every i ∈ {1, . . . , k} either (1) π[i − 1] = (q, ),
π[i] = (q , u ) and (q, w[i], ⊥, q , u ) ∈ δ, or (2) π[i − 1] = (q, ua), π[i] = (q , uu )
and (q, w[i], a, q , u ) ∈ δ. A run π of length k+1 is accepting if π[k+1] ∈ F ×{ },
i.e., the automaton is in an accepting state and the stack is empty. The language
recognized (or accepted) by A, denoted L(A), is the set of words that have an
accepting run.
Context Free Grammar (CFG). A context free grammar in Chomsky normal
form (CFG) is a tuple (Σ, V, s, P), where Σ is the alphabet, V is a set of nonterminals,
s ∈ V is a start symbol and P is a set of production rules. A production
rule p has one of the following forms: (1) p : v → zu, where v, z, u ∈ V ; or
(2) p : v → α, where v ∈ V and α ∈ Σ; or (3) p : s → .
Languages Generated by CFGs. Fix a CFG G = (Σ, V, s, P). We deﬁne
derivation →G as a relation on (Σ ∪ V )∗
× (Σ ∪ V )∗
as follows: w →G w iﬀ
w = w1vw2, with v ∈ V , and w = w1uw2 for some u ∈ (Σ ∪ V )∗
such that
v → u is a production from G. We deﬁne →∗
G as the transitive closure of →G.
The language generated by G, denoted by L(G) = {w ∈ Σ∗
| s →∗
G w} is the
set of words that can be derived from s. CFGs and PDAs are language-wise
polynomially equivalent [11] (i.e., there is a polynomial-time procedure that,
given a PDA, outputs a CFG of the same language and vice versa).
Finite Automata. A non-deterministic ﬁnite automaton (NFA) is a PDA with
empty stack alphabet. We will omit Γ while referring to NFA, i.e., we will con-
Edit Distance for Pushdown Automata 125
sider them as tuples (Σ, Q, S, δ, F). We denote the class of all NFA by NFA.
Analogously to DPDA we deﬁne deterministic ﬁnite automata (DFA).
Language Inclusion. Let C1, C2 be subclasses of PDA. The inclusion problem
from C1 in C2 asks, given A1 ∈ C1, A2 ∈ C2, whether L(A1) ⊆ L(A2).
Edit Distance between Words. Given two words w1, w2, the edit distance
between w1, w2, denoted by ed(w1, w2), is the minimal number of single letter
operations: insertions, deletions, and substitutions, necessary to transform w1
into w2.
Edit Distance between Languages. Let L1, L2 be languages. We
deﬁne the edit distance from L1 to L2, denoted ed(L1, L2), as supw1∈L1
infw2∈L2
ed(w1, w2). The edit distance between languages is not a distance function.
In particular, it is not symmetric.
2.2 Problem Statement
In this section we deﬁne the problems of interest. Then, we recall the previous
results and succinctly state our results.
Deﬁnition 1. For C1, C2 ∈ {DFA, NFA, DPDA, PDA} we deﬁne the following
questions:
1. The threshold edit distance problem from C1 to C2 (denoted TED(C1, C2)):
Given automata A1 ∈ C1, A2 ∈ C2 and an integer threshold k ≥ 0, decide
whether ed(L(A1), L(A2)) ≤ k.
2. The ﬁnite edit distance problem from C1 to C2 (denoted FED(C1, C2)): Given
automata A1 ∈ C1, A2 ∈ C2, decide whether ed(L(A1), L(A2)) < ∞.
3. Computation of edit distance from C1 to C2: Given automata A1 ∈ C1, A2 ∈
C2, compute ed(L(A1), L(A2)).
We establish the complete complexity picture for the TED problem for all combinations
of source and target languages given by DFA, NFA, DPDA and PDA:
1. TED for regular languages has been studied in [2], where PSpacecompleteness
of TED(C1, C2) for C1, C2 ∈ {DFA, NFA} has been established.
2. In Section 3, we study the TED problem for source languages given by
pushdown automata and target languages given by ﬁnite automata. We
establish ExpTime-completeness of TED(C1, C2) for C1 ∈ {DPDA, PDA} and
C2 ∈ {DFA, NFA}.
3. In Section 5, we study the TED problem for target languages given by
pushdown automata. We show that TED(C1, C2) is undecidable for C1 ∈
{DFA, NFA, DPDA, PDA} and C2 ∈ {DPDA, PDA}.
We study the FED problem for all combinations of source and target languages
given by DFA, NFA, DPDA and PDA and obtain the following results:
1. FED for regular languages has been studied in [3]. It has been shown that
for C1 ∈ {DFA, NFA}, the problem FED(C1, DFA) is coNP-complete, while the
problem FED(C1, NFA) is PSpace-complete.
126 K. Chatterjee et al.
2. We show in Section 4 that for C1 ∈ {DPDA, PDA}, the problem FED(C1, NFA)
is ExpTime-complete.
3. We show in Section 5 that (1) for C1 ∈ {DFA, NFA, DPDA, PDA}, the problem
FED(C1, PDA) is undecidable, and (2) the problem FED(DPDA, DPDA) is
undecidable.
3 Threshold Edit Distance from Pushdown to Regular
Languages
In this section we establish the complexity of the TED problem from pushdown
to ﬁnite automata.
Theorem 2. (1) For C1 ∈ {DPDA, PDA} and C2 ∈ {DFA, NFA}, the
TED(C1, C2) problem is ExpTime-complete. (2) For C1 ∈ {DPDA, PDA}, the language
inclusion problem from C1 in NFA is ExpTime-complete.
We establish the above theorem as follows: In Section 3.1, we present an
exponential-time algorithm for TED(PDA, NFA) (for the upper bound of (1)).
Then, in Section 3.2 we show (2), in a slightly stronger form, and reduce it (that
stronger problem) to TED(DPDA, DFA), which shows the ExpTime-hardness part
of (1).
3.1 Upper Bound
We present an ExpTime algorithm that, given (1) a PDA AP ; (2) an NFA AN ;
and (3) a threshold t given in binary, decides whether the edit distance from AP
to AN is above t. The algorithm extends a construction for NFA by Benedikt et
al. [2].
Intuition. The construction uses the idea that for a given word w and an NFA
AN the following are equivalent: (i) ed(w, AN ) > t, and (ii) for each accepting
state s of AN and for every word w , if AN can reach s from some initial state
upon reading w , then ed(w, w ) > t. We construct a PDA AI which simulates
the PDA AP and stores in its states all states of the NFA AN reachable with
at most t edits. More precisely, the PDA AI remembers in its states, for every
state s of the NFA AN , the minimal number of edit operations necessary to
transform the currently read preﬁx wp of the input word into a word wp, upon
which AN can reach s from some initial state. If for some state the number
of edit operations exceeds t, then we associate with this state a special symbol
# to denote this. Then, we show that a word w accepted by the PDA AP has
ed(w, AN ) > t iﬀ the automaton AI has a run on w that ends (1) in an accepting
state of simulated AP , (2) with the simulated stack of AP empty, and (3) the
symbol # is associated with every accepting state of AN .
Lemma 3. (1) Given (i) a PDA AP ; (ii) an NFA AN ; and (iii) a threshold
t given in binary, the decision problem of whether ed(AP , AN ) ≤ t can
be reduced to the emptiness problem for a PDA of size O(|AP | · (t + 2)|AN |
).
(2) TED(PDA, NFA) is in ExpTime.
Edit Distance for Pushdown Automata 127
3.2 Lower Bound
Our ExpTime-hardness proof of TED(DPDA, DFA) extends the idea from [2] that
shows PSpace-hardness of the edit distance for DFA. The standard proof of
PSpace-hardness of the universality problem for NFA [11] is by reduction to the
halting problem of a ﬁxed Turing machine M working on a bounded tape. The
Turing machine M is the one that simulates other Turing machines (such a
machine is called universal). The input to that problem is the initial conﬁguration
C1 and the tape is bounded by its size |C1|. In the reduction, the NFA
recognizes the language of all words that do not encode valid computation of
M starting from the initial conﬁguration C1, i.e., it checks the following four
conditions: (1) the given word is a sequence of conﬁgurations, (2) the state of
the Turing machine and the adjunct letters follow from transitions of M, (3) the
ﬁrst conﬁguration is not C1 and (4) the tape’s cells are changed only by M,
i.e., they do not change values spontaneously. While conditions (1), (2) and (3)
can be checked by a DFA of polynomial size, condition (4) can be encoded by a
polynomial-size NFA but not a polynomial-size DFA. However, to check (4) the
automaton has to make only a single non-deterministic choice to pick a position
in the encoding of the computation, which violates (4), i.e., the value at that
position is diﬀerent from the value |C1| + 1 letters further, which corresponds to
the same memory cell in the successive conﬁguration, and the head of M does
not change it. We can transform a non-deterministic automaton AN checking
(4) into a deterministic automaton AD by encoding such a non-deterministic
pick using an external letter. Since we need only one external symbol, we have
L(AN ) = Σ∗
iﬀ ed(Σ∗
, L(AD)) = 1. This suggests the following deﬁnition:
Deﬁnition 4. An NFA A = (Σ, Q, S, δ, F) is nearly-deterministic if |S| = 1
and δ = δ1 ∪δ2, where δ1 is a function and in every accepting run the automaton
takes a transition from δ2 exactly once.
Lemma 5. There exists a DPDA AP such that the problem, given a nearlydeterministic
NFA AN , decide whether L(AP ) ⊆ L(AN ), is ExpTime-hard.
Proof. Consider the linear-space halting problem for a (ﬁxed) alternating Turing
machine (ATM) M: given an input word w over an alphabet Σ, decide whether
M halts on w with the tape bounded by |w|. There exists an ATM MU , such
that the linear-space halting problem for MU is ExpTime-complete [4]. We show
the ExpTime-hardness of the problem from the lemma statement by reduction
from the linear-space halting problem for MU .
We w.l.o.g. assume that existential and universal transitions of MU alternate.
Fix an input of length n. The main idea is to construct the language L of words
that encode valid terminating computation trees of MU on the given input.
Observe that the language L depends on the given input. We encode a single
conﬁguration of MU as a word of length n + 1 of the form Σi
qΣn−i
, where
q is a state of MU . Recall that a computation of an ATM is a tree, where
every node of the tree is a conﬁguration of MU , and it is accepting if every leaf
node is an accepting conﬁguration. We encode computation trees T of MU by
128 K. Chatterjee et al.
traversing T preorder and executing the following: if the current node has only
one successor, then write down the current conﬁguration C, terminate it with #
and move down to the successor node in T. Otherwise, if the current node has
two successors s, t in the tree, then write down in order (1) the reversed current
conﬁguration CR
; and (2) the results of traversals on s and t, each surrounded
by parentheses ( and ), i.e., CR
( us
) ( ut
) , where us
(resp., ut
) is the result of
the traversal of the subtree of T rooted at s (resp., t). Finally, if the current node
is a leaf, write down the corresponding conﬁguration and terminate with $. For
example, consider a computation with the initial conﬁguration C1, from which
an existential transition leads to C2, which in turn has a universal transition to
C3 and C4. Such a computation tree is encoded as follows:
C1 # CR
2 ( C3 . . . $ ) ( C4 . . . $ ) .
We deﬁne automata AN and AP over the alphabet Σ ∪ {#, $, (, )}. The
automaton AN is a nearly deterministic NFA that recognizes only (but not all)
words not encoding valid computation trees of MU . More precisely, AN accepts
in four cases: (1) The word does not encode a tree (except that the parentheses
may not match as the automaton cannot check that) of computation as presented
above. (2) The initial conﬁguration is diﬀerent from the one given as
the input. (3) The successive conﬁgurations, i.e., those that result from existential
transitions or left-branch universal transitions (like C2 to C3), are valid.
The right-branch universal transitions, which are preceded by the word “)(”, are
not checked by AN . For example, the consistency of the transition C2 to C4 is
not checked by AN . Finally, (4) AN accepts words in which at least one ﬁnal
conﬁguration, which is a conﬁguration followed by $, is not ﬁnal for MU .
Next, we deﬁne AP as a DPDA that accepts words in which parentheses
match and right-branch universal transitions are consistent, e.g., it checks consistency
of a transition from C2 to C4. The automaton AP pushes conﬁgurations
on even levels of the computation tree (e.g., CR
2 ), which are reversed, on
the stack and pops these conﬁgurations from the stack to compare them with
the following conﬁguration in the right subtree (e.g., C4). In the example this
means that, while the automaton processes the subword ( C3 . . . $ ), it can use its
stack to check consistency of universal transitions in that subword. We assumed
that MU does not have consecutive universal transitions. This means that, for
example, AP does not need to check the consistency of C4 with its successive
conﬁguration. By construction, we have L = L(AP ) ∩ L(AN )c
(recall that L is
the language of encodings of computations of MU on the given input) and MU
halts on the given input if and only if L(AP ) ⊆ L(AN ) fails. Observe that AP is
ﬁxed for all inputs, since it only depends on the ﬁxed Turing machine MU .
Now, the following lemma, which is (2) of Theorem 2, follows from Lemma 5.
Lemma 6. The inclusion problem of DPDA in NFA is ExpTime-complete.
Proof. The ExpTime upper bound is immediate (basically, an exponential determinization
of the NFA, followed by complementation, product construction with
Edit Distance for Pushdown Automata 129
the PDA, and the emptiness check of the product PDA in polynomial time
in the size of the product). ExpTime-hardness of the problem follows from
Lemma 5.
Now, we show that the inclusion problem of DPDA in nearly-deterministic NFA,
which is ExpTime-complete by Lemma 5, reduces to TED(DPDA, DFA). In the
reduction, we transform a nearly-deterministic NFA AN over the alphabet Σ
into a DFA AD by encoding a single non-deterministic choice by auxiliary letters.
More precisely, for the transition relation δ = δ1 ∪ δ2 of AN , we transform
every transition (q, a, q ) ∈ δ2 into (q, b(q,a,q )
, q ), where b(q,a,q )
is a fresh auxiliary
letter. Now, consider a DPDA AP over the alphabet Σ. As every word
in L(AD) contains a single auxiliary letter ed(L(AP ), L(AD)) ≥ 1. Conversely,
for every word w ∈ Σ∗
we have ed(w, L(AD)) ≤ 1 implies w ∈ AN . Therefore,
ed(L(AP ), L(AD)) ≤ 1 if and only if L(AP ) ⊆ L(AN ).
Lemma 7. TED(DPDA, DFA) is ExpTime-hard.
4 Finite Edit Distance from Pushdown to Regular
Languages
In this section we study the complexity of the FED problem from pushdown
automata to ﬁnite automata.
Theorem 8. (1) For C1 ∈ {DPDA, PDA} and C2 ∈ {DFA, NFA} we have the
following dichotomy: for all A1 ∈ C1, A2 ∈ C2 either ed(L(A1), L(A2)) is exponentially
bounded in |A1| + |A2| or ed(L(A1), L(A2)) is inﬁnite. Conversely, for
every n there exist a DPDA AP and a DFA AD, both of the size O(n), such
that ed(L(AP ), L(AD)) is ﬁnite and exponential in n (i.e., the dichotomy is
asymptotically tight). (2) For C1 ∈ {DPDA, PDA} the FED(C1, NFA) problem is
ExpTime-complete. (3) Given a PDA AP and an NFA AN , we can compute the
edit distance ed(L(AP ), L(AN )) in time exponential in |AP | + |AN |.
First, we show in Section 4.1 the exponential upper bound for (1), which together
with Theorem 2, implies the ExpTime upper bound for (2). Next, in Section 4.2,
we show that FED(DPDA, NFA) is ExpTime-hard. We also present the exponential
lower bound for (1). Finally, (1), (2), and Theorem 2 imply (3) (by iteratively
testing with increasing thresholds up to exponential bounds along with the decision
procedure from Theorem 2).
4.1 Upper Bound
In this section we consider the problem of deciding whether the edit distance
from a PDA to an NFA is ﬁnite. We start with a reduction of the problem. Given
a language L, we deﬁne L = {u : u is a preﬁx of some word from L}. We call an
automaton A safety if every state of A is accepting. Note that an automaton is
not necessarily total, i.e., some states might not have an outgoing transition for
130 K. Chatterjee et al.
some input symbols, and thus a safety automaton does not necessarily accept all
words. Note that for every NFA AN , the language L(AN ) is the language of a
safety NFA. We show that FED(PDA, NFA) reduces to FED from PDA to safety
NFA.
Lemma 9. Let AP be a PDA and AN an NFA. The following inequalities hold:
ed(L(AP ), L(AN )) ≥ ed(L(AP ), L(AN )) ≥ ed(L(AP ), L(AN )) − |AN |
The following deﬁnition and lemma can be seen as a reverse version of the
pumping lemma for context free grammars (in that we ensure that the part
which can not be pumped is small).
Compact G-decomposition. Given a CFG G = (Σ, V, s, P), where T =
|V |, and a word w ∈ L(G) we deﬁne compact G-decomposition of w as w =
(siui)k
i=1sk+1, where si and ui are subwords of w for all i, such that
1. for all , the word w( ) := (siui)k
i=1sk+1 is in L(G); and
2. |w(0)| =
k+1
i=1 |si| ≤ 2T
and k ≤ 2T +1
− 2.
Lemma 10. For every CFG G = (Σ, V, s, P), every word w ∈ L(G) admits a
compact G-decomposition.
Intuition. The proof follows by repeated application of the principle behind the
pumping lemma, until the part which is not pump-able is small.
Reachability Sets. Fix an NFA. Given a state q in the NFA and a word w, let
Qw
q be the set of states reachable upon reading w, starting in q. The set of states
R(w, q) is then the set of states reachable from Qw
q upon reading any word. For
a set Q and word w, the set R(w, Q ) is q∈Q R(w, q).
We have the following property of reachability sets: Fix a word u, a
number , an NFA and a set of states Q of the NFA, where Q is closed under
reachablity, i.e., for all q ∈ Q and a ∈ Σ we have δ(q, a) ⊆ Q . Let u be a word
with non-overlapping occurrences of u (e.g. u ). Consider any word w with edit
distance strictly less than from u . Any run on w, starting in some state of Q ,
reaches a state of R(u, Q ). This is because u must be a sub-word of w.
Lemma 11. Let G be a CFG with a set of non-terminals of size T and let AN be
a safety NFA with state set Q of size n. The following conditions are equivalent:
(i) the edit distance ed(L(G), L(AN )) is inﬁnite,
(ii) the edit distance ed(L(G), L(AN )) exceeds B := (2T +1
− 2) · n + 2T
, and
(iii) there exists a word w ∈ L(G), with compact G-decomposition w =
(siui)k
i=1sk+1, such that R(uk, R(uk−1, R(uk−2, . . . R(u1, Q) . . .))) = ∅.
Intuition behind the proof: Whenever we consider a word w, the compact Grepresentation
of it is w = (siui)k
i=1sk+1. Let
R(w, j) = R(uj, R(uj−1, R(uj−2, . . . R(u1, Q) . . .)))
Edit Distance for Pushdown Automata 131
for all j and words w. Observe that (i) ⇒ (ii) is trivial. Intuitively, the proof for
(ii) ⇒ (iii) is by contradiction: Consider a word w in L(G) with edit distance
above B from L(AN ). Assume towards contradiction that R(w, k) is not empty.
Then there is a word w = (siui)k
i=1 in L(AN ) where each si has length at most
n. But ed(w, w ) ≤ B by deﬁnition of compact G-representation (i.e. edit each
si to si separately), which is a contradiction. To show (iii) ⇒ (i) we consider
a word w where R(w, k) is empty and we show that w( ) (from compact Grepresentation)
requires at least edits to L(AN ). Inductively in j, there must
be either at least edits on (siui)j
i=1 or R(w, j) has been reached, by the property
of reachability sets. Since R(w, k) is empty, there must be edits on w( ).
The equivalence of (i) and (ii) of Lemma 11 gives a bound on the maximum
ﬁnite edit distance. The following lemma follows from Lemmas 9 and 11, and
Theorem 2 for testing given thresholds.
Lemma 12. For all C1 ∈ {DPDA, PDA}, C2 ∈ {DFA, NFA} the FED(C1, C2) problem
is in ExpTime.
4.2 Lower Bound
We have shown the exponential upper bound on the edit distance if it is ﬁnite.
It is easy to deﬁne a family of CFGs only accepting an exponential length word,
using repeated doubling and thus the edit distance can be exponential between
DPDA and DFA. We also show that the inclusion problem reduces to the ﬁnite
edit distance problem FED(DPDA, NFA) and get the following lemma.
Lemma 13. FED(DPDA, NFA) is ExpTime-hard.
We conjecture that, as for the case of language inclusion, for the ﬁnite edit
distance problem the complexity of the DPDA/PDA to DFA problem matches
the one for NFA/DFA to DFA.
Conjecture 14. FED(PDA, DFA) is coNP-complete.
5 Edit Distance to PDA
Observe that the threshold distance problem from DFA to PDA with the ﬁxed
threshold 0 and a ﬁxed DFA recognizing Σ∗
coincides with the universality
problem for PDA. Hence, the universality problem for PDA, which is undecidable,
reduces to TED(DFA, PDA). The universality problem for PDA reduces to
FED(DFA, PDA) as well by the same argument as in Lemma 13. Finally, we can
reduce the inclusion problem from DPDA in DPDA, which is undecidable, to
TED(DPDA, DPDA) (resp., FED(DPDA, DPDA)). Again, we can use the same
construction as in Lemma 13. In conclusion, we have the following proposition.
Proposition 15. (1) For every class C ∈ {DFA, NFA, DPDA, PDA}, the problems
TED(C, PDA) and FED(C, PDA) are undecidable. (2) For every class C ∈
{DPDA, PDA}, the problem FED(C, DPDA) is undecidable.
132 K. Chatterjee et al.
The results in (1) of Proposition 15 are obtained by reduction from the universality
problem for PDA. However, the universality problem for DPDA is decidable.
Still we show that TED(DFA, DPDA) is undecidable. The overall argument is
similar to the one in Section 3.2. First, we deﬁne a pushdown counterpart of
nearly-deterministic NFA. A PDA A = (Σ, Γ, Q, S, δ, F) is nearly-deterministic
if |S| = 1 and δ = δ1 ∪ δ2, where δ1 is a function and for every accepting run,
the automaton takes a transition from δ2 exactly once.
By carefully reviewing the standard reduction of the halting problem for
Turing machines to the universality problem for pushdown automata [11], we
observe that the PDA that appear in the reduction are nearly-deterministic.
Lemma 16. The problem, given a nearly-deterministic PDA AP , decide
whether L(AP ) = Σ∗
, is undecidable.
Using the same construction as in Lemma 7 we show a reduction of the universality
problem for nearly-deterministic PDA to TED(DFA, DPDA).
Proposition 17. For every class C ∈ {DFA, NFA, DPDA, PDA}, the problem
TED(C, DPDA) is undecidable.
We presented the complete decidability picture for the problems TED(C1, C2), for
C1 ∈ {DFA, NFA, DPDA, PDA} and C2 ∈ {DPDA, PDA}. To complete the characterization
of the problems FED(C1, C2), with respect to their decidability, we
still need to settle the decidability (and complexity) status of FED(DFA, DPDA).
We leave it as an open problem, but conjecture that it is coNP-complete.
Conjecture 18. FED(DFA, DPDA) is coNP-complete.
References
1. Aho, A., Peterson, T.: A minimum distance error-correcting parser for context-free
languages. SIAM J. of Computing 1, 305–312 (1972)
2. Benedikt, M., Puppis, G., Riveros, C.: Regular repair of speciﬁcations. In: LICS
2011, pp. 335–344 (2011)
3. Benedikt, M., Puppis, G., Riveros, C.: Bounded repairability of word languages.
J. Comput. Syst. Sci. 79(8), 1302–1321 (2013)
4. Chandra, A.K., Kozen, D.C., Stockmeyer, L.J.: Alternation. J. ACM 28(1),
114–133 (1981). http://doi.acm.org/10.1145/322234.322243
5. Chatterjee, K., Doyen, L., Henzinger, T.A.: Quantitative languages. ACM Trans.
Comput. Log. 11(4) (2010)
6. Chatterjee, K., Henzinger, T.A., Ibsen-Jensen, R., Otop, J.: Edit distance for pushdown
automata. CoRR abs/1504.08259 (2015). http://arxiv.org/abs/1504.08259
7. Chatterjee, K., Henzinger, T.A., Otop, J.: Nested weighted automata. CoRR
abs/1504.06117 (2015). http://arxiv.org/abs/1504.06117 (to appear at LICS 2015)
8. Chatterjee, K., Ibsen-Jensen, R., Majumdar, R.: Edit distance for timed automata.
In: HSCC 2014, pp. 303–312 (2014)
Edit Distance for Pushdown Automata 133
9. Gawrychowski, P.: Faster algorithm for computing the edit distance between SLPcompressed
strings. In: Calder´on-Benavides, L., Gonz´alez-Caro, C., Ch´avez, E.,
Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 229–236. Springer, Heidelberg
(2012)
10. Henzinger, T.A., Otop, J.: From model checking to model measuring. In:
D’Argenio, P.R., Melgratti, H. (eds.) CONCUR 2013 – Concurrency Theory.
LNCS, vol. 8052, pp. 273–287. Springer, Heidelberg (2013)
11. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages, and
Computation. Adison-Wesley Publishing Company, Reading (1979)
12. Karp, R.: Mapping the genome: some combinatorial problems arising in molecular
biology. In: STOC 93, pp. 278–285. ACM (1993)
13. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and
reversals. Soviet physics doklady. 10, 707–710 (1966)
14. Lifshits, Y.: Processing compressed texts: a tractability border. In: Ma, B., Zhang,
K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 228–240. Springer, Heidelberg (2007)
15. Mohri, M.: Edit-distance of weighted automata: general deﬁnitions and algorithms.
Intl. J. of Foundations of Comp. Sci. 14, 957–982 (2003)
16. Okuda, T., Tanaka, E., Kasai, T.: A method for the correction of garbled words
based on the levenshtein metric. IEEE Trans. Comput. 25, 172–178 (1976)
17. Pighizzini, G.: How hard is computing the edit distance? Information and Computation
165, 1–13 (2001)
18. Saha, B.: The dyck language edit distance problem in near-linear time. In: FOCS
2014, pp. 611–620 (2014)