KŘETÍNSKÝ, Jan, Guillermo PEREZ and Jean-Francois RASKIN. Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints. Online. In 29th International Conference on Concurrency Theory (CONCUR 2018). Dagstuhl: Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2018, p. 1-18. ISBN 978-3-95977-087-3. Available from: https://dx.doi.org/10.4230/LIPIcs.CONCUR.2018.8.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints
Authors KŘETÍNSKÝ, Jan (203 Czech Republic, guarantor, belonging to the institution), Guillermo PEREZ (188 Costa Rica) and Jean-Francois RASKIN (56 Belgium).
Edition Dagstuhl, 29th International Conference on Concurrency Theory (CONCUR 2018), p. 1-18, 18 pp. 2018.
Publisher Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Germany
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
RIV identification code RIV/00216224:14330/18:00108291
Organization unit Faculty of Informatics
ISBN 978-3-95977-087-3
ISSN 1868-8969
Doi http://dx.doi.org/10.4230/LIPIcs.CONCUR.2018.8
Keywords in English Learning; Mean-Payoff; Markov decision process; Omega-Regular Specification
Tags core_A, firank_A
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 27/4/2020 23:49.
Abstract
We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can be achieved depending on how much memory one is willing to use. (i) For all epsilon and gamma we can construct an online-learning finite-memory strategy that almost-surely satisfies the parity objective and which achieves an epsilon-optimal mean payoff with probability at least 1 - gamma. (ii) Alternatively, for all epsilon and gamma there exists an online-learning infinite-memory strategy that satisfies the parity objective surely and which achieves an epsilon-optimal mean payoff with probability at least 1 - gamma. We extend the above results to MDPs consisting of more than one end component in a natural way. Finally, we show that the aforementioned guarantees are tight, i.e. there are MDPs for which stronger combinations of the guarantees cannot be ensured.
Links
GA18-11193S, research and development projectName: Algoritmy pro diskrétní systémy a hry s nekonečně mnoha stavy
Investor: Czech Science Foundation
PrintDisplayed: 8/5/2024 07:23