2017
Value Iteration for Long-Run Average Reward in Markov Decision Processes
ASHOK, Pranav; Krishnendu CHATTERJEE; Przemyslaw DACA; Jan KŘETÍNSKÝ; Tobias MEGGENDORFER et al.Základní údaje
Originální název
Value Iteration for Long-Run Average Reward in Markov Decision Processes
Autoři
ASHOK, Pranav; Krishnendu CHATTERJEE; Przemyslaw DACA; Jan KŘETÍNSKÝ a Tobias MEGGENDORFER
Vydání
Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I, od s. 201-221, 21 s. 2017
Nakladatel
Springer
Další údaje
Typ výsledku
Stať ve sborníku
Utajení
není předmětem státního či obchodního tajemství
Impakt faktor
Impact factor: 0.402 v roce 2005
Označené pro přenos do RIV
Ne
Organizační jednotka
Fakulta informatiky
ISBN
978-3-319-63386-2
ISSN
Změněno: 17. 3. 2025 15:17, RNDr. Pavel Šmerk, Ph.D.
Anotace
V originále
Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks.