(UCRL2) Near-optimal Regret Bounds for Reinforcement Learning

9 minute read

This is a recap of the regert analysis in UCRL2.

1. Problem setting

In a Markov decision process (MDP) \(M\) with finite state space \(\mathcal{S}\) and finite action space \(\mathcal{A}\). The learner in some state \(s\in\mathcal{S}\) chooses an action \(a\in\mathcal{A}\) and receives a random reward \(r\) drawn independently from a distribution on \([0,1]\) with mean \(\bar{r}(s,a)\).

The MDP considered in UCRL2 is fully communicating, which means that starting from one state \(s\), it only costs a finite time step to return it. Formally, let define

\[D(M) := \max_{s\neq s'\in\mathcal{S}} \min_{\pi : \mathcal{S}\rightarrow \mathcal{A}} \mathbf{E}[T(s'\mid M, \pi, s)]\]

\(D(M)\) is called the “diameter” of the MDP \(M\). \(M\) is fully communicating representing that \(M\) is finite number.

2. Algorithm

First, I give a short summary about the UCRL2. In each episode \(k\):

UCRL2 first collects all history statistics \((\{s_t, a_t, r_t\}_{t=1}^{t_k})\). Based on the history statistics, we can estimate the rewards and transition dynamics by \(\hat{r}_k(s,a)\) and \(\hat{p}_k(s'\\|s,a)\). \(\begin{align*} & N^+_k(s,a) = \max{\sum_{t=1}^{t_k-1} 1[(s_t, a_t)=(s,a)], 1} \\ & \hat{r}_k(s,a) = \frac{\sum_{t=1}^{t_k-1} r_t 1[(s_t,a_t)=(s,a)]}{N^+_k(s,a)} \\ & \hat{p}_k(s,a) = \frac{\sum_{t=1}^{t_k-1} 1[(s'_t, s_t,a_t)=(s',s,a)]}{N^+_k(s,a)} \end{align*}\)
Construct a confidence set \(\mathcal{M}_k\) including true MDP \(M\) with high probability. Every MDP in \(\mathcal{M}_k\) should satisfy the following conditions.

\[\begin{align} |\tilde{r}(s,a) - \hat{r}_k(s,a)| \lesssim& \sqrt {\frac{\log(S A t_k / \delta)}{N^+_k(s,a)}} \label{cond:reward} \\ ||\tilde{p}(\cdot|s,a) - \hat{p}_k(\cdot|s,a)||_1 \lesssim& \sqrt {\frac{\log(A t_k / \delta)}{N^+_k(s,a)}} \label{cond:prob} \end{align}\]

\(\lesssim\) represents that the LHS is less than the RHS ignoring constants and logarithm factor.

Use extended value iteration (EVI) to find the most optimal MDP \(\tilde{M}_k\) from the confident set \(\mathcal{M}_k\) at the starting of each episode \(k\), and find a (near) optimal policy \(\tilde{\pi}_k\) on an optimistic MDP \(\tilde{M}_k\) such that
\[\tilde{\rho}_k := \min_s \rho(\tilde{M}_k, \tilde{\pi}_k, s) \geq \max_{M' \in \mathcal{M}_k, \pi, s'} \rho(M', \pi, s') - \frac{1}{t_k}\]
Execute policy \(\tilde{\pi}_k\) and collect statistics \((\{s_t, a_t, r_t\}_{t=t_k+1}^{t_{k+1}})\).

The UCRL2 algorithm will repeat the above steps until the stopping creteria has been satisfied. In step 1, the estimation is simply based on the history statistics before the episode \(k\) and will not be updated during the current episode \(k\). In step 2, the confidence set is built around the estimation derived in the step one. The diemeter of confidence set is proportional to \(\sqrt{\frac{\log(t_k/\delta)}{N^+_k(s,a)}}.\) The exploitation-exploration can be achieved by incearsing \(t_k\) when we do more episodes and increasing \(N^+_k(s,a)\) when the agent visits \((s,a)\) more frequently. In step 3, after the confidence set \(\mathcal{M}_k\) is fixed, an EVI will be deployed to find out the (near) optimal policy \(\tilde{\pi}_k\). In step 4, the agent will execute the policy \(\tilde{\pi}_k\) and collect feedbacks from the environment.

2.1 Extended Value Iteration (EVI)

At each begenning of an episode \(k\), we need to find out the (near) optimal policy \(\tilde{\pi}_k\) and optimistic MDP \(\tilde{M}_k\in \mathcal{M}_k\). Typically, value iteration can find out the optimal policy given a fixed MDP \(M\). However, in UCRL2 we only obtain a set of MDP \(\mathcal{M}_k\).

The main idea of EVI is that finding an MDP \(\tilde{M}\in\mathcal{M}\) and a policy \(\tilde{\pi}\) such that \(\rho(\tilde{M}, \tilde{\pi}, s) = \max_{M'\in\mathcal{M}, \pi, s'}\rho(M', \pi, s')\) (optimal policy) for all initial state \(s\) is equavalent to finding an average reward optimal policy on the augemented MDP \(\tilde{M}^+\).

In the EVI, we denote the value function of the state \(s\) as \(V(s)\). Since we need to run EVI for multiple iterations. We denote \(V_i(s)\) as the value function in the \(i\)-th iteration. \(\begin{align*} \mathrm{Initialize:}& V_0(s) = 0 \\ \mathrm{Update:}& V_{i+1}(s) = \max_{a\in\mathcal{A}}\{ \tilde{r}(s,a) + \max_{p(\cdot)\in\mathcal{P}(s,a)} \{ \sum_{s'\in\mathcal{S}} p(s'\mid s,a) \cdot V_i(s') \} \} \end{align*}\) Since we follow the idea of optimism in the face of uncertainty (OFUL), in each round we estimate value function \(V_{i+1}(s)\) by choosing the optimistic reward \(\tilde{r}(s,a) = \hat{r}(s,a) + O(\sqrt{\frac{\log(t_k/\delta)}{N^+_k(s,a)}})\) added the optimal value gained in the future \(\sum_{s'\in\mathcal{S}} p(s') \cdot V_i(s')\).

Notice that we need to compute the inner maximum to solve the maximization problem in the EVI. While the transition dynamics \(p(s'\\|s,a)\) is not fixed. We use a simple idea that assigns as much as possible probability to the state who has the largest \(V_i(s')\) at the expense of transition probabilities to the states who has small \(V_i(s')\). For example, \(V_i(s_1)\) is largest one and \(V_i(s_n)\) is the smallest one, we let \(p(s_1\mid s,a) = \min\{ 1, \hat{p}(s_1 \mid s,a) + \frac{1}{2} O(\sqrt{\frac{\log(t_k/\delta)}{N^+_k(s,a)}})\}\) and \(p(s_n\mid s,a) = \max\{ 0, \hat{p}(s_n \mid s,a) - \frac{1}{2} O(\sqrt{\frac{\log(t_k/\delta)}{N^+_k(s,a)}})\}\). Although we omit many details of the updating rule but the basic idea has been shown above.

3. Theorem Results

Theoretically, if the agent follows the optimal policy \(\pi^*(M)\) in the MDP \(M\). the regret of the algorithm UCRL2 is

\[\text{Regret}(M, \text{UCRL2}, s, T) := T \rho^*(M) - R(M, \text{UCRL2}, s, T),\]

where \(\rho^*(M)\) is the average reward by running optmal policy on MDP \(M\).

Theorem 2 (Problem independent bound) With probability of at least \(1-\delta\) it holds that for any initial state \(s\in\mathcal{S}\) and any \(T>1\), the regret of UCRL2 is bounded by

\[\text{Regret}(M, \text{UCRL2}, s, T) \leq 34 \cdot DS \sqrt{AT \log(\frac{T}{\delta})}\]

Where \(S=\mid \mathcal{S} \mid\), \(A=\mid\mathcal{A}\mid\) and \(D\) is the diameter of \(M\).

Theorem 3 (Problem dependent bound) With probability of at least \(1-\delta\) it holds that for any \(\varepsilon > 0\), any initial state \(s\in\mathcal{S}\) and any \(T>1\), the regret of UCRL2 is bounded by

\[\text{Regret}(M, \text{UCRL2}, S, T) \leq 34^2 \cdot \frac{D^2 S^2 A \log(\frac{T}{\delta})}{\varepsilon} + \varepsilon T,\]

where \(S=\mid \mathcal{S} \mid\), \(A=\mid\mathcal{A}\mid\) and \(D\) is the diameter of \(M\).

4. Regret Analysis

4.1 The proof of theorem 2

4.1.1 Spiltting into episodes

The total regret can be decomposed as the sum of the regret accumulated in the individual episodes. That is

\[\text{Regret}(M, \text{UCRL2}, S, T)= \sum_{k=1}^K \Delta_k.\]

For each individual episode \(k\), its regret is

\[\Delta_k := v_k(s,a) (\rho^* - \bar{r}(s,a))\]

where \(v_k(s,a)\) is the final counts of state-action pair \((s,a)\) only in episode \(k\). We can regard \(\Delta_k\) as the accumulate regret generated from every decision made in that episode. Each decision made in the state \(s\) taking action \(a\) will contribute a \((\rho^*(s) - \bar{r}(s,a))\) which is the difference between the optimal value and the mean value generated by taking action \(a\) from state \(s\).

4.1.2 Dealing with Failing Confidence Regions

In each epsido-regret \(\Delta_k\), we can further decompose it into two parts, one is the confidence set contains the true \(MDP\), otherwise it is not.

\[\begin{align} \sum_{k=1}^K \Delta_k = \underbrace{\sum_{k=1}^K \Delta_k 1[ M\in\mathcal{M_k} ]}_{\text{Follwing Confidence Regions}} + \underbrace{\sum_{k=1}^K \Delta_k 1[ M\notin\mathcal{M_k} ]}_{\text{Failing Confidence Regions}} \end{align}\]

The first case we have a confidence set that constraints all reward function \(\tilde{r}(s,a)\) and transition probability \(\tilde{p}(\cdot\mid s,a)\) simultaneously. When \(M\) is not contained in the confidence set \(\mathcal{M}_k\), the regret bound should be a lower order than the first one.

\[\begin{align*} \sum_{k=1}^K \Delta_k 1[M\notin\mathcal{M}_k] \leq& \sum_{k=1}^K \sum_{s,a} v_k(s,a)1[M\notin\mathcal{M}_k] \\ \leq& \sum_{k=1}^K t_k \cdot1[M\notin\mathcal{M}_k] \\ =& \sum_{k=1}^K \sum_{i=1}^T t_k \cdot1[t_k=i, M\notin\mathcal{M}_k] \\ =& \sum_{i=1}^T i \sum_{k=1}^K 1[t_k=i, M\notin\mathcal{M}_k] \\ \leq& \sum_{i=1}^T i 1[M\notin\mathcal{M}_k] \\ =& \sum_{i=1}^{\lfloor T^\frac{1}{4} \rfloor} i 1[M\notin\mathcal{M}_k] + \sum_{i=\lfloor T^\frac{1}{4}+1 \rfloor}^{T} i 1[M\notin\mathcal{M}_k] \\ \leq& \sqrt{T} + \sqrt{T} \end{align*}\]

One step has not been clarified in above is the last step where \(\sum_{i=T^\frac{1}{4}+1}^{T} i 1[M\notin\mathcal{M}_k] \leq \sqrt{T}\).

4.1.3 Dealing with Follwing Confidence Regions

In the case where \(M \in \mathcal{M}_k\), the optimixtic average reward \(\tilde{\rho}_k\) chosen by policy \(\tilde{\rho}_k\) is essentially larger than the true optimal average reward \(\rho^*\). The parameters of MDP have been constrained in the confidence set. Therefore we can utilize some concentration inequalities to bound this case. For simplicity, we assume \(M \in \mathcal{M}\) is always true in the analysis in section 4.1.3 and omit \(1[M \in \mathcal{M}]\).

For the accumulated regret in the \(k\)-th episode.

\[\begin{align*} \Delta_k =& \sum_{s,a} v_k(s,a)(\rho^* - \bar{r}(s,a)) \\ \leq& \sum_{s,a} v_k(s,a)(\tilde{\rho}_k+\frac{1}{\sqrt{t_k}} - \bar{r}(s,a)) \\ =& \sum_{s,a} v_k(s,a)(\tilde{\rho}_k - \bar{r}(s,a)) + \sum_{s,a} \frac{v_k(s,a)}{\sqrt{t_k}} \end{align*}\]

The inequality is from the condition that \(M \in \mathcal{M}_k\), thus \(\rho^* \leq \tilde{\rho}_k + \frac{1}{\sqrt{t_k}}\). We can further subtract and add an \(\tilde{r}_k(s, \tilde{\pi}_k(s))\) in the above equation

\[\begin{align*} & \Delta_k \\ \leq& \sum_{s,a} v_k(s,a)(\tilde{\rho}_k - \bar{r}(s,a))1[M\in\mathcal{M}_k] + \sum_{s,a} \frac{v_k(s,a)}{\sqrt{t_k}} \\ =& \sum_{s,a} v_k(s,a)(\tilde{\rho}_k - \tilde{r}_k(s,a) + \tilde{r}_k(s,a) - \bar{r}(s,a)) + \sum_{s,a} \frac{v_k(s,a)}{\sqrt{t_k}} \\ =& \sum_{s,a} v_k(s,a)(\tilde{\rho}_k - \tilde{r}_k(s,a)) + \sum_{s,a} v_k(s,a) (\tilde{r}_k(s,a) - \bar{r}(s,a)) + \sum_{s,a} \frac{v_k(s,a)}{\sqrt{t_k}} \\ \leq& \underbrace{ \mathbf{v_k}(\tilde{P}_k - I) \mathbf{u_i}}_{\mathrm{I}} + \underbrace{\sum_{s,a} v_k(s,a) (\tilde{r}_k(s,a) - \bar{r}(s,a))}_{\mathrm{II}} + 2 \sum_{s,a} \frac{v_k(s,a)}{\sqrt{t_k}} \end{align*}\]

The last inequality is we need to bound the sum of difference of value function modified by the frequency \(v_k(s,a)\). The sum can be decomposed into two terms according to the following analysis. Since the difference of value function among states is bounded, which is

\[\max_s V_i(s) - \min_s V_i(s) \leq D\]

By Theorem 8.5.6. of Puterman (1994), the gain in each step is closed to \(\tilde{\pi}_k\) since we are executing \(\tilde{\pi}_k\) in the episode \(k\). That is we have

\(\begin{align*} \mid V_{i+1}(s) - V_i(s) - \tilde{\rho}_k \mid \leq& \frac{1}{\sqrt{t_k}} \\ \Leftrightarrow \mid \tilde{r}_k(s,a) + \sum_{s'} \tilde{p}_k(s'\mid s,a) \cdot V_i(s') - V_i(s) - \tilde{\rho}_k \mid \leq& \frac{1}{\sqrt{t_k}} \\ \Leftrightarrow \mid (\tilde{r}_k(s,a) - \tilde{\rho}_k) - (\sum_{s'} \tilde{p}_k(s'\mid s,a) \cdot V_i(s') - V_i(s)) \mid \leq& \frac{1}{\sqrt{t_k}} \\ \Rightarrow (\tilde{\rho}_k - \tilde{r}_k(s,a)) - (\sum_{s'} \tilde{p}_k(s'\mid s,a) \cdot V_i(s') - V_i(s)) \leq& \frac{1}{\sqrt{t_k}} \\ \end{align*}\) The first parenthesis is identical to the first summation term and we can relax the summation into term I and the lower order term \(\sum_{s,a} \frac{v_k(s,a)}{\sqrt{t_k}}\).

The first term I: According to the condition bounding probability transition, the first term can be controlled by the concentration inequality

\[\begin{align} & \mathbf{v}_k(\tilde{P}_k - I)\mathbf{u_i} \\ =& \mathbf{v}_k(\tilde{P}_k - I) \mathbf{u_i} + \mathbf{v}_k(\tilde{P}_k - I)(\mathbf{u_i}-\mathbf{w_k}) \\ =& \mathbf{v}_k(\tilde{P}_k - P_k)\mathbf{w_k} + \mathbf{v}_k(P_k - I)\mathbf{w_k} + \sum_{s,a} \mathbf{v}_k(\tilde{P}_k - I)(\mathbf{u_i}-\mathbf{w_k}) \end{align}\]

where \(w_k(s) := u_i(s) - \frac{\min_s u_i(s) + \max_s u_i(s)}{2}\) and \(\mid\mid w_k \mid\mid \leq \frac{D}{2}\). The first sum can be controlled by the definition of \(\mathcal{M}_k\) where the variation of the probability transition matrix is upper bounded. The second sum can be regarded as a sum of martingle and can be controlled by Azuma-Hoeffding inequality since we can treat each variable \(X_t := (p(\cdot \mid s_t,a_t) - e_{s_{t+1}}) w_{k(t)} 1[M\in\mathcal{M}_{k(t)}]\), where \(k(t)\) represents the episode contains step \(t\). The third sum is equal to \(0\) because the sum of rows of \(\tilde{P}_k\) is \(1\).

Therefore, for the term I we have

\[\begin{align} &\sum_{s,a} v_k(s,a)(\tilde{\rho}_k - \tilde{r}_k(s,a)) \\ \leq& D\sqrt{14S \log(\frac{2AT}{\delta})} \cdot \sum_{k=1}^{m}\sum_{s,a} \frac{v_k(s,a)}{\sqrt{N_+(s,a)}} + \sum_{t=1}^T X_t + mD + \sum_{s,a}\frac{v_k(s,a)}{\sqrt{t_k}} \end{align}\]

The second term II: Since it is the sum of reward difference between the reward obtained by executing empirical optimal policy \(\tilde{\pi}_k\), \(\tilde{r}_k(s,a)\), to the mean reward \(\bar{r}_k(s,a)\). We can use \(\hat{r}_k(s,a)\) as the bridge since the condifence set \(\mathcal{M}_k\) is constructed centered around \(\hat{r}_k(s,a)\). Both \(\tilde{r}_k(s,a)\) and \(\bar{r}_k(s,a)\) should satisfy the reward confidence condition.

\[\begin{align} &\sum_{s,a} v_k(s,a) (\tilde{r}_k(s,a) - \bar{r}(s,a)) \\ =& \sum_{s,a} v_k(s,a) (\tilde{r}_k(s,a) - \hat{r}_k(s,a) + \hat{r}_k(s,a) - \bar{r}(s,a)) \\ \leq& \sum_{s,a} v_k(s,a) |\tilde{r}_k(s,a) - \hat{r}_k(s,a)| + \sum_{s,a} v_k(s,a) |\hat{r}_k(s,a) - \bar{r}(s,a))| \\ \lesssim& 2\sum_{s,a} v_k(s,a) \sqrt{\frac{\log(2S A t_k/\delta)}{N_+(s,a)}} \end{align}\]

Overall, we can add up those upper bound and get

\[\begin{align} & \sum_{k=1}^m \Delta_k 1[M\in\mathcal{M}_k] \\ \lesssim& D\sqrt{T\log(\frac{8T}{\delta})} + DSA \log(\frac{8T}{SA}) + D\sqrt{S\log(\frac{2AT}{\delta})}\sqrt{SAT} \\ \lesssim& DS\sqrt{AT \log(\frac{AT}{\delta})} \end{align}\]

Share on

Twitter Facebook LinkedIn

Hao Qin