Policy Learning for Time-Bounded Reachability in
Continuous-Time Markov Decision Processes via Doubly-Stochastic
Gradient Ascent

D 2016

Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent

BRÁZDIL, Tomáš, Ezio BARTOCCI, Dimitrios MILIOS, Guido SANGUINETTI, Luca BORTOLUSSI et. al.

Basic information

Original name

Policy Learning for Time-Bounded Reachability in Continuous-Time Markov Decision Processes via Doubly-Stochastic Gradient Ascent

Authors

BRÁZDIL, Tomáš (203 Czech Republic, guarantor, belonging to the institution), Ezio BARTOCCI (380 Italy), Dimitrios MILIOS (300 Greece), Guido SANGUINETTI (380 Italy) and Luca BORTOLUSSI (380 Italy)

Edition

Quebec City, Proceedings of QEST 2016, p. 244-259, 16 pp. 2016

Publisher

Springer

Other information

Language

English

Type of outcome

Proceedings paper

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Canada

Confidentiality degree

is not subject to a state or trade secret

Publication form

printed version "print"

Impact factor

Impact factor: 0.402 in 2005

RIV identification code

RIV/00216224:14330/16:00088513

Organization unit

Faculty of Informatics

ISBN

978-3-319-43424-7

ISSN

DOI

http://dx.doi.org/10.1007/978-3-319-43425-4_17

UT WoS

000389063800017

Keywords in English

continuous-time Markov decision processes; reachability; gradient descent

Abstract

V originále

Continuous-time Markov decision processes are an important class of models in a wide range of applications, ranging from cyber-physical systems to synthetic biology. A central problem is how to devise a policy to control the system in order to maximise the probability of satisfying a set of temporal logic specifications. Here we present a novel approach based on statistical model checking and an unbiased estimation of a functional gradient in the space of possible policies. The statistical approach has several advantages over conventional approaches based on uniformisation, as it can also be applied when the model is replaced by a black box, and does not suffer from state-space explosion. The use of a stochastic gradient to guide our search considerably improves the efficiency of learning policies. We demonstrate the method on a proof-of-principle non-linear population model, showing strong performance in a non-trivial task.

Links

GA15-17564S, research and development project

Name: Teorie her jako prostředek pro formální analýzu a verifikaci počítačových systémů

Investor: Czech Science Foundation

Přehled o publikaci