2 minute read

This is a recap of the “inverse bandit” problem first proposed in this paper.

1. Motivation

In the Reinforcement learning area, most of time we assume the reward feedback is accessible and we can do Value-Iteration or Policy-Iteration to find the optimal value function. Another approach is that the agent or the designer of RL algorithm is not able to obtain precise reward feedback from the environment, called Inverse Reinforcement Learning. Instead of learning from the environment, we have an expert to demonstrate how to finish the task with a sequence of near-optimal/optimal actions and let the agent to imitate the expert’s behavior, which is called Imitation Learning (IL). The advantage of IL is obvarious that we do not have to engineer the reward function and avoid the misspefication problem when the reward function is not aligned with our expectation. However, the policy learnt from IL is hard to surpass the expert’s demonstration. Also, by only observing the expert’s policy we may not be uniquely identifiable the reward function since there might be infinitely many reward functions can explain the expert’s policy. The optimal policy leaks less information about the suboptimal actions and the learning agent cannot learn those suboptimal actions enough to rebuild the reward function. Then guo2022learning introduces the “inverse bandit” problem that force the agent learns from the learning trajectory of another bandit algorithm.

1.1 The “Inverse Bandit” Problem

Suppose that the agent can observer a demonstration which is sequence of decisions made by another bandit algorithm, \(\{ I_t \}_{t=1}^T\), from the demonstration the agent expects to estimate the expected rewards \(\{ \mu_i \}_{i\in[K]}\) in a \(K\)-armed bandit setting. Basically, the reward estimation can be viewed as a mapping from \(\{ I_t \}_{t=1}^T\) \(to\){ \hat{mu}i }{i\in[K]}$$

\[\text{Reward estimation :} \{ I_t \}_{t=1}^T \to\{ \hat{\mu}_i \}_{i\in[K]}\]

A good estimation can minimize the misspecification error \(\mathbb{E}[ \mid \hat{\mu}_i - \mu_i \mid]\)

1.2 The Base Line - Information-theoretic Lower Bound

Theorem 1 For every \(K\)-armed Bernoulli bandit instance \(\mathcal{M}\) satisfying \(\max_{i\in[K]} \mid \mu_i - \frac{1}{2} \mid \leq \frac{1}{4}\) and for each suboptimal arm \(i\neq i^\star\), the following is true. Suppose that the demonstrator employs algorithm \(\mathbb{A}\), and let \(\mathbb{E}[N^{\mathbb{A}}_{t, T}]\) denote the expected number of times arm \(i\) is pulled by \(\mathbb{A}\) when presented the instance \(\mathcal{M}\). Then there exists an instance \(\mathcal{M}'\) such that for any reward estimation procedure having knowledge of \(\mu^\star\) and mappling \(\{I_t\}_{t=1}^T \to \{ \hat{\mu}_i \}_{i\in[K], i \neq i^\star}\),

\[\max_{\tilde{\mathcal{M}} \in \{ \mathcal{M}. \mathcal{M}' \}} \mathbb{E}[|\hat{\mu}_i - \mu_i(\tilde{\mathcal{M}})|] \geq \frac{1}{16} \cdot \left( \frac{1}{\mathbb{E}[N^{\mathbb{A}}_{t, T}]} \wedge 1 \right)\]

Categories:

Updated: