Verification of Markov Decision Processes Using Learning Algorithms* Tomáš Brázdil1, Krishnendu Chatterjee2, Martin Chmelík2, Vojtěch Forejt3, Jan Křetínský2, Marta Kwiatkowska3, David Parker4, and Mateusz Ujma3 1 Masaryk University, Brno, Czech Republic 2 1ST, Austria 3 University of Oxford, UK 4 University of Birmingham, UK Abstract. We present a general framework for applying machine-learning algorithms to the verification of Markov decision processes (MDPs). The primary goal of these techniques is to improve performance by avoiding an exhaustive exploration of the state space. Our framework focuses on probabilistic reachability, which is a core property for verification, and is illustrated through two distinct instantiations. The first assumes that full knowledge of the MDP is available, and performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP, and yields probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. The latter is the first extension of statistical model checking for unbounded properties in MDPs. In contrast with other related techniques, our approach is not restricted to time-bounded (finite-horizon) or discounted properties, nor does it assume any particular properties of the MDP. We also show how our methods extend to LTL objectives. We present experimental results showing the performance of our framework on several examples. 1 Introduction Markov decision processes (MDPs) are a widely used model for the formal verification of systems that exhibit stochastic behaviour. This may arise due to the possibility of failures (e.g. of physical system components), unpredictable events (e.g. messages sent across a lossy medium), or uncertainty about the environment (e.g. unreliable sensors in a robot). It may also stem from the explicit use of randomisation, such as probabilistic routing in gossip protocols or random back-off in wireless communication protocols. Verification of MDPs against temporal logics such as PCTL and LTL typically reduces to the computation of optimal (minimum or maximum) reachability probabilities, either on the MDP itself or its product with some deterministic cj-automaton. Optimal * This research was funded in part by the European Research Council (ERC) under grant agreement 267989 (QUAREM), 246967 (VERIWARE) and 279307 (Graph Games), by the EU FP7 project HIERATIC, by the Austrian Science Fund (FWF) projects S11402-N23 (RiSE), S11407-N23 (RiSE) and P23499-N23, by the Czech Science Foundation grant No P202/12/P612, by EPSRC project EP/K038575/1 and by the Microsoft faculty fellows award. F. Cassez and J.-F. Raskin (Eds.): ATVA 2014, LNCS 8837, pp. 98-BT412014. © Springer International Publishing Switzerland 2014 Verification of Markov Decision Processes Using Learning Algorithms 99 reachability probabilities (and a corresponding optimal strategy for the MDP) can be computed in polynomial time through a reduction to linear programming, although in practice verification tools often use dynamic programming techniques, such as value iteration which approximates the values up to some pre-specified convergence criterion. The efficiency or feasibility of verification is often limited by excessive time or space requirements, caused by the need to store a full model in memory. Common approaches to tackling this include: symbolic model checking, which uses efficient data structures to construct and manipulate a compact representation of the model; abstraction refinement, which constructs a sequence of increasingly precise approximations, bypassing construction of the full model using decision procedures such as SAT or SMT; and statistical model checking ElO, which uses Monte Carlo simulation to generate approximate results of verification that hold with high probability. In this paper, we explore the opportunities offered by learning-based methods, as used in fields such as planning or reinforcement learning [36J. In particular, we focus on algorithms that explore an MDP by generating trajectories through it and, whilst doing so, produce increasingly precise approximations for some property of interest (in this case, reachability probabilities). The approximate values, along with other information, are used as heuristics to guide the model exploration so as to minimise the solution time and the portion of the model that needs to be considered. We present a general framework for applying such algorithms to the verification of MDPs. Then, we consider two distinct instantiations that operate under different assumptions concerning the availability of knowledge about the MDP, and produce different classes of results. We distinguish between complete information, where full knowledge of the MDP is available (but not necessarily generated and stored), and limited information, where (in simple terms) we can only sample trajectories of the MDP. The first algorithm assumes complete information and is based on real-time dynamic programming (RTDP) [3J. In its basic form, this only generates approximations in the form of lower bounds (on maximum reachability probabilities). While this may suffice in some scenarios (e.g. planning), in the context of verification we typically require more precise guarantees. So we consider bounded RTDP (BRTDP) [30J, which supplements this with an additional upper bound. The second algorithm assumes limited information and is based on delayed Q-learning (DQL) [35J. Again, we produce both lower and upper bounds but, in contrast to BRTDP, where these are guaranteed to be correct, DQL offers probably approximately correct (PAC) results, i.e., there is a nonzero probability that the bounds are incorrect. Typically, MDP solution methods based on learning or heuristics make assumptions about the structure of the model. For example, the presence of end components lTi31l (subsets of states where it is possible to remain indefinitely with probability 1) can result in convergence to incorrect values. Our techniques are applicable to arbitrary MDPs. We first handle the case of MDPs that contain no end components (except for trivial designated goal or sink states). Then, we adapt this to the general case by means of on-the-fly detection of end components, which is one of the main technical contributions of the paper. We also show how our techniques extend to LTL objectives and thus also to minimum reachability probabilities. 100 T. Brazdil et al. Our DQL-based method, which yields PAC results, can be seen as an instance of statistical model checking [37 19J, a technique that has received considerable attention. Until recently, most work in this area focused on purely probabilistic models, without nondeterminism, but several approaches have now been presented for statistical model checking of nondeterministic models [13.14.27.4.28,18,29]. However, these methods all consider either time-bounded properties or use discounting to ensure convergence (see below for a summary). The techniques in this paper are the first for statistical model checking of unbounded properties on MDPs. We have implemented our framework within the PRISM tool [25J. This paper concludes with experimental results for an implementation of our BRTDP-based approach that demonstrate considerable speed-ups over the fastest methods in PRISM. Detailed proofs omitted due to lack of space are available in [7J. 1.1 Related Work In fields such as planning and artificial intelligence, many learning-based and heuristic-driven solution methods for MDPs have been developed. In the complete information setting, examples include RTDP [3J and BRTDP [30J, as discussed above, which generate lower and lower/upper bounds on values, respectively. Most algorithms make certain assumptions in order to ensure convergence, for example through the use of a discount factor or by restricting to so-called Stochastic Shortest Path (SSP) problems, whereas we target arbitrary MDPs without discounting. More recently, an approach called FRET EH was proposed for a generalisation of SSP, but this gives only a onesided (lower) bound. We are not aware of any attempts to apply or adapt such methods in the context of probabilistic verification. A related paper is O, which applies heuristic search methods to MDPs, but for generating probabilistic counterexamples. As mentioned above, in the limited information setting, our algorithm based on delayed Q-learning (DQL) yields PAC results, similar to those obtained from statistical model checking ifTTllI 9U34H. This is an active area of research with a variety of tools [21,8,6,5]. In contrast with our work, most techniques focus on time-bounded properties, e.g., using bounded LTL, rather than unbounded properties. Several approaches have been proposed to transform checking of unbounded properties into testing of bounded properties, for example, [38 17 33,32]. However, these focus on purely probabilistic models, without nondeterminism, and do not apply to MDPs. In 0], unbounded properties are analysed for MDPs with spurious nondeterminism, where the way it is resolved does not affect the desired property. More generally, the development of statistical model checking techniques for probabilistic models with nondeterminism, such as MDPs, is an important topic, treated in several recent papers. One approach is to give the nondeterminism a probabilistic semantics, e.g., using a uniform distribution instead, as for timed automata in [13 14 27]. Others [28,18], like this paper, aim to quantify over all strategies and produce an e-optimal strategy. The work in [28] and [18] deals with the problem in the setting of discounted (and for the purposes of approximation thus bounded) or bounded properties, respectively. In the latter work, candidates for optimal schedulers are generated and gradually improved, but "at any given point we cannot quantify how close to optimal the candidate scheduler is" and "the algorithm does not estimate the maximum Verification of Markov Decision Processes Using Learning Algorithms 101 probability of the property" (cited from [29]). Further, [29J considers compact representation of schedulers, but again focuses only on (time) bounded properties. Since statistical model checking is simulation-based, one of the most important difficulties is the analysis of rare events. This issue is, of course, also relevant for our approach; see the section on experimental results. Rare events have been addressed using methods such as importance sampling IIT7II2H and importance splitting [22]. End components in MDPs can be collapsed either for algorithmic correctness [15] or efficiency EH (where only lower bounds on maximum reachability probabilities are considered). Asymptotically efficient ways to detect them are given in imTO. 2 Basics about MDPs and Learning Algorithms We begin with basic background material on MDPs and some fundamental definitions for our learning framework. We use N, Q, and R to denote the sets of all non-negative integers, rational numbers and real numbers respectively. Dist(X) is the set of all rational probability distributions over a finite or countable set X, i.e., the functions / : X —>• [0,1] n Q such that ^2xeX f(x) = 1> and suPP(f) denotes the support of /. 2.1 Markov Decision Processes We work with Markov decision processes (MDPs), a widely used model to capture both nondeterminism (e.g., for control or concurrency) and probability. Definition 1. An MDP is a tuple M = (S, s, A, E, A), where S is a finite set of states, s G S is an initial state, A is a finite set of actions, E : S —> 2A assigns non-empty sets of enabled actions to all states, and A : SxA —> Dist(S) is a (partial) probabilistic transition function defined for all s and a where a G E(s). Remark 1. For simplicity of presentation we assume w.l.o.g. that, for every action a G A, there is at most one state s such that a G E(s), i.e., E(s) n E(s') = 0 for s ^= s'. If there are states s, s' such that a G E(s) n E(s'), we can always rename the actions as (s, a) G E(s), and (s', a) G E(s'), so that the MDP satisfies our assumption. An infinite path of an MDP M is an infinite sequence lu = soaosiai ■ ■ ■ sucn that a>i G E(si) and A(si, ai)(si+i) > 0 for every i G N. A finite path is a finite prefix of an infinite path ending in a state. We use last(m) to denote the last state of a finite path bj. We denote by IPath (resp. FPath) the set of all infinite (resp. finite) paths, and by IPaths (resp. FPaths) the set of infinite (resp. finite) paths starting in a state s. A state s is terminal if all actions a G E(s) satisfy A(s, a)(s) = 1. An end component (EC) of M is a pair (S', A') where S' C S and A' C \Js(_s, E(s) such that: (1) if A(s, a)(s') > 0 for some s G S' and a G A', then s' G S'; and (2) for all s, s' G S' there is a path uj = soao ■ ■ ■ sn such that so = s, sn = s' and for all 0 < i < n we have a,i G A'. A maximal end component (MEC) is an EC that is maximal with respect to the point-wise subset ordering. Strategies. A strategy of MDP M is a function a : FPath —>• Dist(A) satisfying supp((j(Lu)) C E(last(uj)) for every lu G FPath. Intuitively, the strategy resolves the 102 T. Brazdil et al. choices of actions in each finite path by choosing (possibly at random) an action enabled in the last state of the path. We write Sm for the set of all strategies in M. In standard fashion [23], a strategy a induces, for any initial state s, a probability measure Pr^ s over IPaths. A strategy a is memoryless if er(w) depends only on last(m). Objectives and values. Given a set F C S of target states, bounded reachability for step k, denoted by ()-kF, refers to the set of all infinite paths that reach a state in F within k steps, and unbounded reachability, denoted by ()F, refers to the set of all infinite paths that reach a state in F. Note that ()F = [Jk>0 ()-kF. We consider the reachability probability Pr^ S(()F), and strategies that maximise this probability. We denote by V(s) the value in s, defined by sup^g^ Pr^ S(()F). Given e > 0, we say that a strategy a is e-optimal in s if Pr^ S(()F) + e > V(s), and we call a 0-optimal strategy optimal. It is known [31J that, for every MDP and set F, there is a memoryless optimal strategy for ()F. We are interested in strategies that approximate the value function, i.e., e-optimal strategies for some e > 0. 2.2 Learning Algorithms for MDPs In this paper, we study a class of learning-based algorithms that stochastically approximate the value function of an MDP. Let us fix, for this section, an MDP M = (S, s, A, E, A) and target states F C S. We denote by V : S x A ->• [0,1] the value function for state-action pairs of M, defined for all (s, a) where s E S and a e E(s): Intuitively, V(s, a) is the value in s assuming that the first action performed is a. A learning algorithm A simulates executions of M, and iteratively updates upper and lower approximations U : S x A —>• [0,1] and L : S x A —>• [0,1], respectively, of the value function V : S x A ->• [0,1]. The functions U and L are initialised to appropriate values so that L(s,a) < V(s,a) < U(s,a) for all s e S and a e A. During the computation of A, simulated executions start in the initial state s and move from state to state according to choices made by the algorithm. The values of U(s, a) and L(s,a) are updated for the states s visited by the simulated execution. Since m&xaeE^ U(s, a) and m&xaeE^ L(s, a) represent upper and lower bound on V(s), a learning algorithm A terminates when maxne£(s) U(s, a) — maxaeE(s) L(s, a) < e where the precision e > 0 is given to the algorithm as an argument. Note that, because U and L are possibly updated based on the simulations, the computation of the learning algorithm may be randomised and even give incorrect results with some probability. Definition 2. Denote by A(s) the instance of learning algorithm A with precision e. We say that A converges surely (resp. almost surely) if for every e > 0, the computation of A(s) surely (resp. almost surely) terminates, and L(s, a) < V(s, a) < U(s, a) holds upon termination. In some cases, almost-sure convergence cannot be guaranteed, so we demand that the computation terminates correctly with sufficiently high probability. In such cases, we assume the algorithm is also given a confidence 5 > 0 as an argument. Verification of Markov Decision Processes Using Learning Algorithms 103 Definition 3. Denote by A(e, 5) the instance of learning algorithm A with precision e and confidence 5. We say that A is probably approximately correct (PAC) if, for every e > 0 and every 5 > 0, with probability at least 1 — 5, the computation of A(e, 5) terminates with L(s, a) < V(s, a) < U(s, a). The function U defines a memoryless strategy ||, a number Em > maxseg |_B(s)|, a number 0 < q < Pmin, where pm-m = mm{A(s, a)(s') \ s e S, a e E(s),s' G supp(A(s,a))}, and the function E (more precisely, given a state s, the learning procedure can ask an oracle for E(s)). We assume that the algorithm may simulate an execution ofM starting with s and choosing enabled actions in individual steps. Definition 5. A learning algorithm has complete information about M if it knows the complete MDP M. Note that the MDPs we consider are "fully observable", so even in the limited information case strategies can make decisions based on the precise state of the system. 3 MDPs without End Components We first present algorithms for MDPs without ECs, which considerably simplifies the adaptation of BRTDP and DQL to unbounded reachability objectives. Later, in Section @1 we extend our methods to deal with arbitrary MDPs (with ECs). Let us fix an MDP M = (S, s, A, E, A) and a target set F. Formally, we assume the following. Assumption-EC. MDP M has no ECs, except two trivial ones containing distinguished terminal states 1 and 0, respectively, with F = {1}, ^(l) = 1 and ^(0) = 0. 3.1 Our Framework We start by formalising a general framework for learning algorithms, as outlined in the previous section. We then instantiate this and obtain two learning algorithms: BRTDP and DQL. Our framework is presented as AlgorithmQl and works as follows. Recall that functions U and L store the current upper and lower bounds on the value function V. Each iteration of the outer loop is divided into two phases: explore and update. In the Explore phase (lines 5 -10), the algorithm samples a finite path lu in M from s to a state in {1, 0} by always randomly choosing one of the enabled actions that maximises 104 T. Brazdil et al. lgorithm 1. Learning algorithm (for MDPs with no ECs) Inputs: An EC-free MDP M [/(■,■) <- 1,L(.,.) <-0 L(l, ■) <— 1, U(0, ■) <— 0 > Initialise repeat cj <- s I* Explore phase */ repeat a <— sampled uniformly from arg maxaeB(Jasl(u)) U(last(ui), a) s <— sampled according to A(last(u>), a) > GetSucc(cj, a) until s e {1,0} > TerminatePath(cj) repeat /* Update phase */ s' <— pop(ui) a <— pop(u>) s <— last (id) update((s,a),s') until cj = s until maxaeB(s) ?7(s, a) — maxa6£(j) L(s, a) < e > terminate the U value, and sampling the successor state using the probabilistic transition function. In the update phase (lines 11 - 16), the algorithm updates U and L on the state-action pairs along the path in a backward manner. Here, the function pop pops and returns the last letter of the given sequence. 3.2 Instantiations: BRTDP and DQL Our two algorithm instantiations, BRTDP and DQL, differ in the definition of update. Unbounded Reachability with BRTDP. We obtain BRTDP by instantiating Update with Algorithm Q which requires complete information about the MDP. Intuitively, Update computes new values of U(s, a) and L(s, a) by taking the weighted average of the corresponding U and L values, respectively, over all successors of s via action a. Formally, denote U(s) = m&xaeE^ U(s, a) and L(s) = m&xaeE^ L(s, a). Algorithm 2. BRTDP instantiation of Algorithm □ 1: procedure Update((s, a), ■) 2: U(s,a) := £s,e5 A(s,a)(s')U(s') 3: L(s,a) := £s,g5 A(s,a)(s')L(s') The following theorem says that BRTDP satisfies the conditions of Definition 0 and never returns incorrect results. Theorem 1. The algorithm BRTDP converges almost surely under Assumption-EC. Remark 3. Note that, in the explore phase, an action maximising the value of U is chosen and the successor is sampled according to the probabilistic transition function of M. However, we can consider various modifications. Actions and successors may be chosen in different ways (e.g., for GetSucc), for instance, uniformly at random, Verification of Markov Decision Processes Using Learning Algorithms 105 in a round-robin fashion, or assigning various probabilities (bounded from below by some fixed p > 0) to all possibilities in any biased way. In order to guarantee almost-sure convergence, some conditions have to be satisfied. Intuitively we require, that the state-action pairs used by e-optimal strategies have to be chosen enough times. If this condition is satisfied then the almost-sure convergence is preserved and the practical running times may significantly improve. For details, see Section 0 Remark 4. The previous BRTDP algorithm is only applicable if the transition probabilities are known. However, if complete information is not known, but A(s,a) can be repeatedly sampled for any s and a, then a variant of BRTDP can be shown to be probably approximately correct. Unbounded Reachability with DQL. Often, complete information about the MDP is unavailable, repeated sampling is not possible, and we have to deal with only limited information about M (see Definition BJ). For this scenario, we use DQL, which can be obtained by instantiating UPDATE with Algorithm 01 Algorithm 3. DQL (delay to, estimator precision e) instantiation of Algorithm!!] 1: procedure Update((s, a), &•') 2: if c(s, a) = m and LEARN(s, a) then 3: if accumlll{s, a)/m < U(s, a) — 2e then 4: U(s, a) ^- accurnI^l(s, a)/m + e 5: accum,m(s,a) = 0 6: if accurrim(s, a)/m > L(s, a) + 2e then 7: L(s, a) ^— accum^(s, a)/m — e 8: accurn^l(s, a) = 0 9: c(s, a) = 0 10: else 11: accumlll{s,a) ^— accumlll{s,a) + U(s') 12: accurn^l(s, a) ^— accurn^l(s, a) + L(s') 13: c(s, a) ^— c(s, a) + 1 Macro LEARN(s, a) is true in the fcth call of Update((s, a), ■) if, since the (fc — 2m)th call of Update((s, a), ■), line 4 was not executed in any call of Update(-, ■). The main idea behind DQL is as follows. As the probabilistic transition function is not known, we cannot update U(s,a) and L(s,a) with the actual values j]S'es ^(s; a)(s')U(s') and ^2s,eS A(s, a)(s')L(s'), respectively. However, we can instead use simulations executed in the explore phase of Algorithm Q] to estimate these values. Namely, we use accum^s, a)/m to estimate ^2s,eS A(s, a)(s')U(s') where accurals, a) is the sum of the U values of the last m immediate successors of (s, a) seen during the explore phase. Note that the delay m must be chosen large enough for the estimates to be sufficiently close, i.e., e-close, to the real values. So, in addition to U(s, a) and L(s,a), the algorithm uses new variables accurals, a) and accurals, a) to accumulate U(s,a) and L(s,a) values, respectively, and a counter c(s, a) recording the number of invocations of a in s since the last update (all these variables are initialised to 0 at the beginning of computation). Assume that a has been invoked in s during the explore phase of Algorithm d 106 T. Brazdil et al. Fig.l. MDP M with an EC (left), MDP M{mi,m2} constructed from M in on-the-fly BRTDP (centre), and MDP M' obtained from M by collapsing C = ({mi, 1712}, {a, b}) (right) which means that Update((s, a), s') is eventually called in the update phase of Algorithm [U with the corresponding successor s' of (s,a). If c(s, a) = m at that time, a has been invoked in s precisely m times since the last update concerning (s, a) and the procedure Update((s, a), s') updates U(s, a) with accurals, a)/m plus an appropriate constant e (unless LEARN is false). Here, the purpose of adding e is to make U(s,a) stay above the real value V(s,a) with high probability. If c(s, a) < to, then Update((s, a), s') simply accumulates U(s') into accurals, a) and increases the counter c(s, a). The L(s,a) values are estimated by accurals, a)/m in a similar way, just subtracting e from accurals, a). The procedure requires to and e as inputs, and they are chosen depending on e and 5; more precisely, we choose e = E'^PlD^g^— and m = ln^s^A^^—s.—HH ancj establish that DQL is probably approximately correct. The parameters m and e can be conservatively approximated using only the limited information about the MDP (i.e. using K, Em and q). Even though the algorithm has limited information about M, we still establish the following theorem. Theorem 2. DQL is probably approximately correct under Assumption-EC. Bounded Reachability. Algorithm dean be trivially adapted to handle bounded reachability properties by preprocessing the input MDP in standard fashion. Namely, every state is equipped with a bounded counter with values ranging from 0 to /j where k is the step bound, the current value denoting the number of steps taken so far. All target states remain targets for all counter values, and every non-target state with counter value k becomes rejecting. Then, to determine the fc-step reachability in the original MDP, we compute the (unbounded) reachability in the new MDP. Although this means that the number of states is multiplied by k + 1, in practice the size of the explored part of the model can be small. 4 Unrestricted MDPs We first illustrate with an example that the algorithms BRTDP and DQL as presented in Section El may not converge when there are ECs in the MDP. Example L Consider the MDP M in Fig. ffl (left) with EC ({m1,m2}, {a, b}). The values in states m\,m,2 are V(m\) = Vim^) = 0.5 but the upper bounds are Verification of Markov Decision Processes Using Learning Algorithms 107 [/(mi) = U(m,2) = 1 for every iteration. This is because [/(mi, a) = U(m,2, b) = 1 and both algorithms greedily choose the action with the highest upper bound. Thus, in every iteration t of the algorithm, the error for the initial state mi is U(mi) — V(mi) = i and the algorithm does not converge. In general, any state in an EC has upper bound 1 since, by definition, there are actions that guarantee the next state is in the EC, i.e., is a state with upper bound 1. This argument holds even for standard value iteration with values initialised to 1. One way of dealing with general MDPs is to preprocess them to identify all MECs [ 10.9J and "collapse" them into single states (see e.g. irRUTTIl'). These algorithms require that the graph model is known and explore the whole state space, but this may not be possible either due to limited information (see Definition© or because the model is too large. Hence, we propose a modification to the algorithms from the previous sections that allows us to deal with ECs "on-the-fly". We first describe the collapsing of a set of states and then present a crucial lemma that allows us to identify ECs to collapse. Collapsing States. In the following, we say that an MDP M' = (S', s',A', E', A'} is obtained from M = (S,s, A, E, A) by collapsing a tuple (R, B), where R C S and B C A with B C \Js£R E(s) if: - S' = (S\R)U{s(r:b)}, - s' is either S(R B) or s, depending on whether s e R or not, - A' = A\B, - E'(s) = E(s), fovseS\R; E'(s^B)) = \JseR E(s) \ B, - A' is defined for all s e S' and a e E'(s) by: • A'(s, a)(s') = A(s, a)(s') for s, s' ± s(fljB), • A'(s, a)(s(RtB)) = Y,s>eR A(s> a)(s') for s ^ s(R,b)> • A'(s(R B), a)(s') = A(s, a)(s') for s' ^ s(r,b) and s the unique state with a e E(s) (see Remarkd), • A'(s(R^B),a)(s(R,B)) = ^Zs'eR A(s, a)(s') where s is the unique state with aeE(s). We denote the above transformation, which creates M' from M, as the collapse function, i.e., collapse(i?, B). As a special case, given a state s and a terminal state s' e {0,1}, we use MakeTerminal(s, s') as shorthand for Collapse({s, s'}, E(s)), where the new state is renamed to s'. Intuitively, after MakeTerminal(s, s'), every transition previously leading to state s will now lead to the terminal state s'. For practical purposes, it is important to note that the collapsing does not need to be implemented explicitly, but can be done by keeping a separate data structure which stores information about the collapsed states. Identifying ECs from Simulations. Our modifications will identify ECs "on-the-fly" through simulations that get stuck in them. The next lemma establishes the identification principle. To this end, for a path lu, let us denote by Appear(uj, i) the tuple (Si, Ai) of M such that s e Si and a e Ai(s) if and only if (s, a) occurs in lu more than i times. Lemma 1. Let c = exp(— (pmm/Em)h / k), where k = KEm + 1, and let i > k. Assume that the explore phase in Algorithm 0 terminates with probability less than 1. Then, provided the explore phase does not terminate within 3z3 iterations, the conditional probability that Appear (uj, i) is an EC is at least 1 — 2clii ■ (pmin/Em) K. 108 T. Bräzdil et al. The above lemma allows us to modify the explore phase of Algorithm [U in such a way that simulations will be used to identify ECs. The ECs discovered will subsequently be collapsed. We first present the overall skeleton (Algorithm BJ) for treating ECs "on-the-fly", which consists of two parts: (i) identification of ECs; and (ii) processing them. The instantiations for BRTDP and DQL will differ in the identification phase. Hence, before proceeding to the individual identification algorithms, we first establish the correctness of the processing phase. Algorithm 4. Extension for general MDPs 1: function On-the-fly-EC 2: M i— iDENTIFYECs 3: for all (R, B) e M do 4: collapse(i?,s) 5: for all s e R and a e E(s) \ B do 6: U(s(RtB),a) <-U(s,a) 7: L(s(ÄiB),a) -f- L(s,a) 8: ifi?nF/0then 9: MakeTerminal(s(äjb), 1) 10: else if no actions enabled in S(ü,b) then 11: MakeTerminal(s(AjB),0) Lemma 2. Assume (R, B) is an EC in MDP M, VM the value before the PROCESS ECs procedure in Algorithm^ and Vm' the value after the procedure, then: - fori e {0,1} if maketerminal(s(fl B), i) is called, then Ms e R: Vm(s) = i, -VseS\R: VM(s) = VM,(s), - Vs e R: VM(s) = Vm'(s^b))- Interpretation of Collapsing. Intuitively, once an EC (R, B) is collapsed, the algorithm in the Explore phase can choose a state s e R and action a e E(s) \ B to leave the EC. This is simulated in the explore phase by considering all actions of the EC uniformly at random until s is reached, and then action a is chosen. Since (R, B) is an EC, playing all actions of B uniformly at random ensures s is almost surely reached. Note that the steps made inside a collapsed EC do not count towards the length of the explored path. Now, we present the on-the-fly versions of BRTDP and DQL. For each case, we describe: (i) modification of Algorithmd (ii) identification of ECs; and (iii) correctness. 4.1 Complete Information (BRTDP) Modification of Algorithm IB To obtain BRTDP working with unrestricted MDPs, we modify AlgorithmD]as follows: for iteration i of the explore phase, we insert a check after line 9 such that, if the length of the path lu explored (i.e., the number of states) is ki (see below), then we invoke the On-THE-FLY-EC function for BRTDP. The On-THE-FLY-EC function possibly modifies the MDP by processing (collapsing) some ECs as described in Algorithmic After the On-THE-FLY-EC function terminates, we interrupt > Identification of ECs > Process ECs Verification of Markov Decision Processes Using Learning Algorithms 109 the current EXPLORE phase, and start the EXPLORE phase for the z+l-th iteration (i.e., generating a new path again, starting from s in the modified MDP). To complete the description we describe the choice of ki and identification of ECs. Choice of ki. Because computing ECs can be expensive, we do not call On-THE-FLY-EC every time a new state is explored, but only after every ki steps of the repeat-until loop at lines I^UTTilin iteration i. The specific value of ki can be decided experimentally and change as the computation progresses. A theoretical bound for ki to ensure that there is an EC with high probability can be obtained from Lemmad Identification of ECs. Given the current explored path uj, let (T, G) be Appear (uj, 0), that is, the set of states and actions explored in uj. To obtain the ECs from the set T of explored states, we use Algorithm 0 This computes an auxiliary MDP MT = (T", s, A', E', A'} defined as follows: - T = T U {t | 3s e T, a e E(s) such that A(s, a)(t) > 0}, - A> = {JseTE(S)U{±}, - E'(s) = E(s) if s e T and E'(s) = {_L} otherwise, - A'(s, a) = A(s, a) if s e T, and A'(s, _L)(s) = 1 otherwise. It then computes all MECs of MT that are contained in T and identifies them as ECs. The following lemma states that each of these is indeed an EC in the original MDP. Algorithm 5. Identification of ECs for BRTDP 1: function IdentifyECs(M,T) 2: compute MT 3: M' <- MECs of MT 4: M <- {(R,B) e M' | R C T} Lemma 3. Let M, MT be the MDPs from the construction above and T be the set of explored states. Then every MEC (R, B) in MT such that RCT is an EC in M. Finally, we establish that the modified algorithm, which we refer to as on-the-fly BRTDP, almost surely converges; the proof is an extension of Theoremd Theorem 3. On-the-fly BRTDP converges almost surely for all MDPs. Example 2. Let us describe the execution of the on-the-fly BRTDP on the MDP M from Fig. [U (left). Choose ki > 6 for all i. The loop at lines El to E3 of Algorithm ID generates a path lu that contains some (possibly zero) number of loops mi am2b followed by mi am-2 cm-3 dt where t G {0,1}. In the subsequent UPDATE phase, we set U(rri3,d) = L(m^,d) = 0.5 and then U(m,2,c) = L(m,2,c) = 0.5; none of the other values change. In the second iteration of the loop at lines 13 to EJ the path bj' = mi am-2 b mi am-2 b ... is being generated, and the newly inserted check for On-THE-FLY-EC will be triggered once uj achieves the length ki. The algorithm now aims to identify ECs in the MDP based on the part of the MDP explored so far. To do so, the MDP MT for the set T = {mi, 7722} is constructed and we depict it in Fig. d (centre). We then run MEC detection on MT, finding that ({mi, m,2}, {a, b}) is an EC, and so it gets collapsed according to the COLLAPSE procedure. This gives the MDP M' from Fig. d(right). 110 T. Brazdil et al. The execution then continues with M'. A new path is generated at lines to E3 of Algorithm Dl suppose it is lu" = sccm^dO. In the update phase we then update the value U(sc, d) = L(sc, d) = 0.5, which makes the condition at the last line of Algorithm!!] satisfied, and the algorithm finishes, having computed the correct value. 4.2 Limited Information (DQL) Modification of Algorithm U and Identification of ECs. The modification of Algorithm [His done exactly as for the modification of BRDTP (i.e., we insert a check after line 9 of Explore, which invokes the On-the-fly-EC function if the length of path lu exceeds ki). In iteration i, we set ki as 3£3, for some li (to be described later). The identification of the EC is as follows: we consider Appear(uj, li), the set of states and actions that have appeared more than li times in the explored path lu, which is of length 3£3, and identify the set as an EC; i.e., M. in line 2 of Algorithm His defined as the set containing the single tuple Appear(lu, li). We refer to the algorithm as on-the-fly DQL. Choice of li and Correctness. The choice of li is as follows. Note that, in iteration i, the error probability, obtained from Lemmad is at most 2c^,£3 • (pm[n/Em) K and we choose li such that 2ctll\ ■ (pmin/Em) K < 4^, where 5 is the confidence. Note that, since c < 1, we have that ci% decreases exponentially, and hence for every i such li exists. It follows that the total error of the algorithm due to the on-the-fly EC collapsing is at most 5/2. It follows from the proof of Theorem 0 that for on-the-fly DQL the error is at most 5 if we use the same e as for DQL, but now with DQL confidence 5/4, i.e., with m = ln(24liSll^-l(*+—s—UH_ As before, these numbers can be conservatively approximated using the limited information. Theorem 4. On-the-fly DQL is probably approximately correct for all MDPs. Example 3. Let us now briefly explain the execution of on-the-fly DQL on the MDP M from Fig. [H (left). At first, paths of the same form as lu in Example 0 will be generated and there will be no change to U and L, because in any call to update (see Algorithm01) for states s e {mi, m2} with c(s, a) = m the values accumulated in accurals, a) /m and accurals, a) /m are the same as the values already held, namely 1 and 0, respectively. At some point, we call update for the tuple (m^, d) with c(m^, d) = m, which will result in the change of U(m^, d) and L(m^, d). Note, that at this point, the numbers accum^s, d)/m and accum^s, d)/m are both equal to the proportion of generated paths that visited the state 1. This number will, with high probability, be very close to 0.5, say 0.499. We thus set U(m3,d) = 0.499 + e and L(m3, d) = 0.499 - e. We then keep generating paths of the same form and at some point also update U(m2, c) and L(m2, c) to precisely 0.499 + e and 0.499 — e, respectively. The subsequently generated path will be looping on mi and m2, and once it is of length li, we identify ({mi,m2}, {a, b}) as an EC due to the definition of Appear(lu, li). We then get the MDP from Fig. d (right), which we use to generate new paths, until the upper and lower bounds on value in the new initial state are within the required bound. Verification of Markov Decision Processes Using Learning Algorithms 111 4.3 Extension to LTL So far we have focused on reachability, but our techniques also extend to linear temporal logic (LTL) objectives. By translating an LTL formula to an equivalent deterministic lu-automaton, verifying MDPs with LTL objectives reduces to analysis of MDPs with lu-regular conditions such as Rabin acceptance conditions. A Rabin acceptance condition consists of a set {(M1,N1)... (Md, Nd)} of d pairs (Mi, N), where each M4 C S and Ni C S. The acceptance condition requires that, for some 1 < i < d, states in Mi are visited infinitely often and states in JVj are visited finitely often. Value computation for MDPs with Rabin objectives reduces to optimal reachability of winning ECs, where an EC (R, B) is winning if R n Mi ^ 0 and R n Ni = 0 for some l