Long-Term Values in Markov Decision Processes and Repeated Games, and a New Distance for Probability Spaces

Artigo Acesso aberto Revisado por pares

Long-Term Values in Markov Decision Processes and Repeated Games, and a New Distance for Probability Spaces

2016; Institute for Operations Research and the Management Sciences; Volume: 42; Issue: 2 Linguagem: Inglês

10.1287/moor.2016.0814

ISSN

1526-5471

Autores

Jérôme Renault, Xavier Venel,

Tópico(s)

Risk and Portfolio Optimization

Resumo

We study long-term Markov decision processes (MDPs) and gambling houses, with applications to any partial observation MDPs with finitely many states and zero-sum repeated games with an informed controller. We consider a decision maker who is maximizing the weighted sum ∑ t ≥ 1 𝜃 t r t , where r t is the expected reward of the t-th stage. We prove the existence of a very strong notion of long-term value called general uniform value, representing the fact that the decision maker can play well independently of the evaluations (𝜃 t ) t ≥ 1 over stages, provided the total variation (or impatience) [Formula: see text] is small enough. This result generalizes previous results of the literature that focus on arithmetic means and discounted evaluations. Moreover, we give a variational characterization of the general uniform value via the introduction of appropriate invariant measures for the decision problems, generalizing the fundamental theorem of gambling or the Aumann–Maschler cav(u) formula for repeated games with incomplete information. Apart the introduction of appropriate invariant measures, the main innovation in our proofs is the introduction of a new metric, d * , such that partial observation MDPs and repeated games with an informed controller may be associated with auxiliary problems that are nonexpansive with respect to d * . Given two Borel probabilities over a compact subset X of a normed vector space, we define [Formula: see text], where D 1 is the set of functions satisfying ∀ x, y ∈ X, ∀ a, b ≥ 0, af(x) − bf(y) ≤ ‖ax − by‖. The particular case where X is a simplex endowed with the L 1 -norm is particularly interesting: d * is the largest distance over the probabilities with finite support over X, which makes every disintegration nonexpansive. Moreover, we obtain a Kantorovich–Rubinstein-type duality formula for d * (u, v), involving couples of measures (α, β) over X × X such that the first marginal of α is u and the second marginal of β is v.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Long-Term Values in Markov Decision Processes and Repeated Games, and a New Distance for Probability Spaces