D 2020

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

BRÁZDIL, Tomáš, Krishnendu CHATTERJEE, Petr NOVOTNÝ and Jiří VAHALA

Basic information

Original name

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Authors

BRÁZDIL, Tomáš (203 Czech Republic, belonging to the institution), Krishnendu CHATTERJEE (356 India), Petr NOVOTNÝ (203 Czech Republic, guarantor, belonging to the institution) and Jiří VAHALA (203 Czech Republic, belonging to the institution)

Edition

Palo Alto, California, USA, The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, p. 9794-9801, 8 pp. 2020

Publisher

AAAI Press

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

United States of America

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

RIV identification code

RIV/00216224:14330/20:00114279

Organization unit

Faculty of Informatics

ISBN

978-1-57735-823-7

UT WoS

000668126802030

Keywords in English

reinforcement learning; Markov decision processes; Monte Carlo tree search; risk aversion

Tags

International impact, Reviewed
Změněno: 15/5/2024 01:27, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

Markov decision processes (MDPs) are the defacto framework for sequential decision making in the presence of stochastic uncertainty. A classical optimization criterion for MDPs is to maximize the expected discounted-sum payoff, which ignores low probability catastrophic events with highly negative impact on the system. On the other hand, risk-averse policies require the probability of undesirable events to be below a given threshold, but they do not account for optimization of the expected payoff. We consider MDPs with discounted-sum payoff with failure states which represent catastrophic outcomes. The objective of risk-constrained planning is to maximize the expected discounted-sum payoff among risk-averse policies that ensure the probability to encounter a failure state is below a desired threshold. Our main contribution is an efficient risk-constrained planning algorithm that combines UCT-like search with a predictor learned through interaction with the MDP (in the style of AlphaZero) and with a risk-constrained action selection via linear programming. We demonstrate the effectiveness of our approach with experiments on classical MDPs from the literature, including benchmarks with an order of 10^6 states.

Links

GA18-11193S, research and development project
Name: Algoritmy pro diskrétní systémy a hry s nekonečně mnoha stavy
Investor: Czech Science Foundation
GA19-15134Y, interní kód MU
Name: Verifikace a analýza pravděpodobnostních programů
Investor: Czech Science Foundation
GJ19-15134Y, research and development project
Name: Verifikace a analýza pravděpodobnostních programů
MUNI/A/1050/2019, interní kód MU
Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace IX (Acronym: SV-FI MAV IX)
Investor: Masaryk University, Category A
MUNI/G/0739/2017, interní kód MU
Name: Pushing the limits in automated NMR structure determination using a single 4D NOESY spectrum and machine learning methods
Investor: Masaryk University, INTERDISCIPLINARY - Interdisciplinary research projects