6 minute read

This is a recap of the regert analysis in SCAL.

1. Motivation

Given a finite Markov decision process (MDP) \(M\) with finite state space \(\mathcal{S}\) and finite action space \(\mathcal{A}\), the agent recovers from a bad state to a good state is quantified as the Diameter which is the expected number of steps an agent needs to move return to the start state. This is an approximation to the regret needed for recovering and each step contributes \(1\) since the reward function projects into \([0,1]\). However, this approximation is too loose since in expectation, each step might not contribute \(1\) to the total regret. Especially in the weakly-communicating MDPs the diameter is infinite, resulting in the regret bound of UCRL2 becomes infinity since \(D = \infty\). To mitigate this problem, SCAL first proposes a computationally tractable algorithms that can solve weakly-communicating MDPs and improve the regret upper bound by \(\tilde{O}(\min \{ D, c\} S \sqrt{AT})\) (More precisely, \(\tilde{O}(\min \{ D, c\} \sqrt{\Gamma S AT})\)).

Notice that SCAL is not the first one to give a MDP algorithm with regret upper bound as \(\tilde{O}(\min \{ D, c\} S \sqrt{AT})\). Bartlett & Tewari, 2009 proposes three algorithms achieveing such bound but none of them are computationally tractable. SCAL is developed based on one of those three (REGAL.C) and only needs to solve a computationally efficient algorithm, called ScOpt.

2. Problem Setting

Weakly-communicating MDP: \(M = <\mathcal{S}, \mathcal{A}. r. p>\) with a finite state set \(\mathcal{S}\) and a finite action set \(\mathcal{A}\). The reward function \(r(s,a)\) has a support set \([0, r_{\max}]\). The transition dynamics are stationary \(p(\cdot \mid s,a)\). Denote \(A = \max_{s\in\mathcal{S}} \mid \mathcal{A_s} \mid\) and \(S=\mid \mathcal{S} \mid\). Here, we introduce a new symbol \(\Gamma\) to denote the maximum support of all transition probabilities.

Markov randomized decision rule and policy: A Markov randomized decision rule is a mapping from state space to distributions over actions, \(d: \mathcal{S} \rightarrow P(\mathcal{A})\). The set of \(d\) is \(D^{MR}\) and the subset of Markov deterministic decision rule is \(D^{MD}\). A policy is a sequential decision making process that choose one rule to apply each time. A stationary policy would apply one single Markov randomized decision rule \(d\) all the time so denote it by \(\pi = (d, d, \dots) =: d^\infty\). The collection of stationary policies is \(\Pi^{SR}(M)\) and the deterministic one is \(\Pi^{SD}(M)\).

Reward relations:The long-term average reward (or gain), denoted by \(g^\pi_M(s)\), is the asymptotic average reward gain of a policy \(\pi\in \Pi^{SR}(M)\) starting from \(s\in\mathcal{S}\)

\[g^\pi_M(s) := \lim_{T\rightarrow +\infty} \mathbb{E}_{\mathbb{Q}}[\frac{1}{T} \sum_{t=1}^T r(s_t, a_t)]\]

Bias function is the expected bias toward gain by executing policy \(\pi\),

\[h^\pi_M(s):=C_{\lim} \mathbb{E}_{\mathbb{Q}}[\sum_{t=1}^T(r(s_t,a_t) - g^\pi_M(s_t))]\]

The expectation is taken over the distribution \(\mathbb{Q}:=\mathbb{P}(\cdot \mid a_t \sim \pi(s_t); s_0 = s; M)\) which is the conditional probability distribution given pulling action \(a_t\) based on policy \(\pi\).

Notice that for different starting state \(s\), we may obtain different bias \(h^\pi_M(s)\), thus the difference of bias values, \(h^\pi_M(s_1) - h^\pi_M(s_2)\) quantifies the advantage/disadvantage of starting in state \(s_1\) rather than \(s_2\). Now we can define the spam as the maximum difference of bias values since we want to quantify the maximum contribution to the total regret by recovering from the bad state.

\[sp(h^\pi) := \max_s h^\pi(s) - \min_s h^\pi(s)\]

Bellman operators and evaluation equations: For the Bellman operator associated with a Markov randomized decision rule \(d\), \(L_d\), it safisfies that given any value function \(v \in \mathbb{R}^S\),

\[L_d v:=r_d + P_d v\]

and the optimal Bellman operator \(L\), it satisfies that given any value function \(v \in \mathbb{R}^S\),

\[L v := \max_{d\in D^{MR}} \{ r_d + P_d v\}\]

Then we can derive two evaluation equations for any stationary policy \(\pi=d^\infty \in \Pi^{SD}\)

\[g^\pi = P_d g^\pi; h^\pi = L_d h^\pi - g^\pi\]

Moreover, there exists a optimal policy \(\pi^\star \in \arg \max_\pi\) for which \((g^\star, h^\star) = (g^{\pi^\star}, h^{\pi^\star}\) satisfy the optimality equation.

\[h^\star = L h^\star - g^\star e, \text{where }e=(1,\dots, 1)^T\]

3. Algorithm

3.1 Recap UCRL

UCRL computes a set of plausible MDPs defined as \(\mathcal{M}_k = \{ M= <\mathcal{S}, \mathcal{A}, \tilde{r}, \tilde{p}>: \tilde{r} \in B^k_r(s,a), \tilde{p}(s'\mid s,a) \in B^k_p(s,a,s'), \sum_{s'\in\mathcal{S}} \tilde{p}(s'\mid s,a) = 1 \}\).

The condifence interval \(B^k_r(s,a)\) and \(B^k_p(s,a,s')\) are built based on Hoffiding Inequality (UCRL, UCRL2) or Bernstein’s inequality (SCAL). The optimization problem solved by UCRL is to find \((\tilde{M}^\star_k, \tilde{\pi}^\star_k)\) such that

\[\begin{align*} \text{UCRL2 Optimization problem : }&\\ (\tilde{M}^\star_k, \tilde{\pi}^\star_k) \in& \arg\max_{M \in \mathcal{M}_k, \pi\in\Pi^{SD}(M)} g^\pi_M \end{align*}\]

From UCRL2 we know that solving the above optimization/maximization problem is equavalent to to finding the optimal policy $$\tilde{\pi}^\star** in the extended MDP(Bounded-parameter MDP),

\[\tilde{\pi}^\star = \arg\max_{\pi\in\Pi^{SD}(\tilde{\mathcal{M}}_k)} \{ g^{\tilde{\pi}}_{\tilde{M}} \}\]

An extended value iteration (EVI) has been used in UCRL2 to solve the above optimization problem.

3.2 REGAL.C

REGAL.C follows the same procedure of UCRL but constrains the plusible set of MDP in a finite-span set. Formally speaking,

\[\begin{align*} \text{REGAL.C Optimization problem : }&\\ (\tilde{M}^\star_{RC}, \tilde{\pi}^\star_{RC}) \in& \arg\max_{M \in \mathcal{M}_{RC}, \pi\in\Pi^{SD}(M)} g^\pi_M \end{align*}\]

where \(\mathcal{M}_{RC}:= \{ M\in\mathcal{M}_k : sp(h^\star_M) \leq c \}\) is the set of plusible MDP with bounded bias span for the optimal policy \(\pi^\star\).

3.3 Relaxation of REGAL.C

This paper relaxes the optimization problem defined in REGAL.C as follows

\[\begin{align*} \text{SCAL Optimization problem : }&\\ (\tilde{M}^\star_{c}, \tilde{\pi}^\star_{c}) \in& \arg\max_{M \in \mathcal{M}_{k}, \pi\in\Pi_c(M)} g^\pi_M \end{align*}\]

where \(\Pi_c(M) := \{ \pi\in\Pi^{SD}: sp(h^\pi_M) \leq c \wedge sp(g^\pi_M)=0 \}\). Notice that although the plausible MDP set does not have finite bias span condition, we still have such constrain for the result policy since the policy space where we search installs the constrain. In REGAL.C we have the constrain on the MDP in terms of its optimal policy and exclude more MDPs than the relaxation version. Since we can include any MDPs from \(\mathcal{M}_k\), we only need to find the optimal policy satisfying the finite-span condition because a MDP having non-finite span optimal policy without any constrain can still generate a policy in the constrained policy space satisfying finite-span constrain.

Moreover, we have the following property to guarantee that the optimal gain in the constrained MDP space is smaller than optimal gain in the constrained policy space.

Proposition 1 Define the following restricted set of MDPs \(\mathcal{E} = \mathcal{M}_{RC} \cap \{ M \in \mathcal{M}_k, \pi\in\Pi_c(M): sp(g^\star_M)=0 \}\). Then

\[\sup_{M\in\mathcal{E}_k, \pi\in\Pi^{SD}} g^\pi_M \leq \sup_{M\in\mathcal{M}_k, \pi\in\Pi_c(M)} g^\pi_M\]

Proposition 1 guarantees that the solution to the modified optimization problem of SCAL is always better than the solution to the REGAL.C.

3.4 Span-constrained value and policy operators

4. Main Result

Theorem 1 For any weakly communicating MDP \(M\) such that \(sp(h^\star_M) \leq c\), with probability at least \(1-\delta\) it holds that for any \(T \geq 1\), the regret of SCAL is bounded as

\[\text{Regret}(\text{SCAL}, S, T) = O(\max\{ r_{\text{max}}, c \} \sqrt{\Gamma S A T \ln(\frac{T}{\delta})})\]

5. Regret Analysis

5.1 The proof of theorem 1

5.1.1 Removing the Randomness of the Output Policy

The proof idea follows the standard way used by UCRL2. The total regret will be decomposed as the sum of regret from each episode and bound each of them individaully. Therefore

\[\text{Regret}(M, \text{SCAL}, S, T)= \sum_{t=1}^T \Delta_t.\]

One difference between SCAL and UCRL2 is that the output policy of SCAL in each episode is a stochastic policy. Hence the definition of \(\Delta_i\) needed to be changed. Now \(\Delta_t = g^\star - r_t(s_t, a_t)\) instead of the cumulative regret from one episode. Next, consider at time step \(t\), where \(k_t\) is the correponding episode and \(\tilde{\pi}_{k_t}\) is the policy executed in the episode \(k_t\), we let \(X_t:= r_t(s_t, a_t) - \sum_{a\in\mathcal{A}_{s_t}} r(s_t,a_t) \tilde{\pi}_{k_t}(s_t, a)\) and \(\{ X_t \}_{t=1}^T\) becomes a Martingale Difference Sequence. Then we can use Azuma-Hoffding’s Inequality to remove the randomness of \(r_t(s_t, a_t)\).

\[\mathbf{P} \left( \sum_{t=1}^T r_t(s_t, a_t) \leq \sum_{t=1}^T \mathbb{E}_{a_t\sim\tilde{\pi}_{k_t}}r(s_t, a_t) - r_{\text{max}} \sqrt{\frac{5}{2} T \ln(\frac{11T}{\delta}}) \right) < \frac{\delta}{20T^{5/4}}\]

5.1.2 Dealing with Failing Confidence Regions

This part is pretty similar to the section in UCRL2. Using Bernstein Inequality to bound the corner cases.

5.1.3 Dealing with Follwing Confidence Regions

Categories:

Updated: