Retirement, Stopping times and Bandits: The Gittins index

“A colleague of high repute asked an equally well-known colleague:
— What would you say if you were told that the multi-armed bandit problem had been solved?
— Sir, the multi-armed bandit problem is not of such a nature that it can be solved.”

Peter Whittle

In our busy daily life, while multi-tasking we are constantly faced with the following dilemma:

Q1: Which project should we prioritize?

Now, suppose that we are rather single-tasking on our favorite project, and someone makes a buyout offer. If we accept it, we receive a lump-sum payment under the condition that we must drop the project.

The following question naturally arises:

Q2: What is the minimum offer we should accept to drop everything?

It turns out that the the two questions are intimately related. In fact, under some technical assumptions that we discuss in this post, the answers to question Q2 is the following:

A2: “We should prioritize the project which is associated with the highest offer amount we would be willing to accept to drop it”

On the other hand, the answer to Q1 is the Gittins index, which is an expectedly profound concept and lends itself to multiple interpretations, that we discuss next.

Formal scenario. Suppose that we have ${N}$ different parallel projects, and at each discrete steps ${t=0,1,\dots}$ we must decide which project to work on. Projects go through ups and downs, modeled via the notion of state. Let ${s^n_t}$ be the state of project ${n}$ at time ${t}$ . The state of a project evolves according to a Markov chain only after being selected, else it remains unchanged. When working on project ${n}$ at time ${t}$ , a corresponding reward ${r^n(s^n_t)}$ is collected.

Goal. Assume that the Markov laws governing the state evolution of each project are known. Then, we aim at maximizing the total expected discounted reward:

$\displaystyle \max_{\pi} R(\pi) := \sum_{t\ge 0} \beta^t r^{\pi_t}(s_t^{\pi_t}) \ \ \ \ \ (1)$

where ${\pi}$ decides at each step ${t}$ the project ${\pi_t}$ we should work on, and ${\beta\in[0,1)}$ is a pre-defined geometric discount factor.

The model above has been the first form of Multi-Armed Bandit (MAB) problem that has been formulated in the literature, where each project is an arm. It has been deemed unsolvable (with complexity linear in the number of projects) until the 70’s, when John C. Gittins and his collaborators made important contributions that we resume in this post.

Full-blown MDP formulation. We can model the problem as a Markov Decision Process (MDP) [1], where the MDP (meta-)state is the collection of states of each project, the action is the project selection, the MDP (meta-)reward is the reward collected by working on the selected project, and the (meta-)state evolves according to the Markov chain underlying the state evolution of the selected project.

In the quest for a low complexity solution. The MDP formulation above suffers from the common curse of dimensionality: the state space grows exponentially with the number of projects.

Thus, we seek a cheap (optimal) solution, whose complexity ideally grows linearly with the number of projects. And such solution exists! Please buckle up and enjoy the ride towards the profound concept of Gittins index [2].

1. The single-project case: When should I retire?

To solve the general problem (1), we first tackle an auxiliary one that reads as follows. Suppose that there exists only one project (hence for notation simplicity, we will drop the project index). Moreover, at each time we are offered the possibility to retire and gain a fixed reward ${\gamma}$ at every subsequent step. Else, if we continue to work on the project, the state evolves according to a Markov chain and we receive rewards as described at the beginning of the post.

Our goal is to maximize the total discounted reward, also called value. The optimal value ${v}$ satisfies the following (recursive) Bellman equation:

$\displaystyle v(s;\gamma) = \max\left\{ \frac{\gamma}{1-\beta}, \, r(s) + \beta \sum_{s'} p(s'|s) v(s';\gamma) \right\}, \quad \forall\, s \ \ \ \ \ (2)$

where the first term in the ${\max}$ expression corresponds to the decision of retiring and gaining ${\sum_{t\ge 0} \beta^t \gamma=\frac{\gamma}{1-\beta}}$ for the rest of our life while the second term is the long-term (optimal) reward if we decides to work for one more step.

Intuition says that, as the subsidy ${\gamma}$ increases, we are less and less enticed to work, until a certain subsidy value is reached where retiring and keep working for at least one step are equally appealing. Also, such value should depend on the current state of the project. Next we validate and formalize our intuitions, see also [3].

Theorem 1 The optimal retirement policy is threshold-based. More specifically, for each state ${s}$ , there exists ${\overline{\gamma}(s)}$ , called Gittins index, such that an optimal policy prescribes:

if ${\gamma \le \overline{\gamma}(s)}$ , then continue working;

if ${\gamma > \overline{\gamma}(s)}$ , then retire.

Proof: We start by proving that ${v(s;\gamma)-\frac{\gamma}{1-\beta}}$ is non-increasing in ${\gamma}$ . Let ${v_k(s;\gamma)}$ be the optimal value when it is permitted to work for at most ${k}$ steps before retiring. Note that ${v(s;\gamma)=\lim_{k\rightarrow \infty} v_k(s)}$ . We will prove it by induction on ${v_k}$ . For ${k=0}$ ,

$\displaystyle v_0(s;\gamma) = \tfrac{\gamma}{1-\beta}. \ \ \ \ \ (3)$
Clearly, ${v_0(s;\gamma)-\frac{\gamma}{1-\beta}}$ is non-increasing in ${\gamma}$ . For any ${k\ge 1}$ ,

$\displaystyle v_{k}(s;\gamma)-\tfrac{\gamma}{1-\beta} = \max\left\{ 0, \, r(s) + \beta \sum_{s'} p(s'|s) \left[v_{k-1}(s';\gamma) - \tfrac{\gamma}{1-\beta}\right] -\gamma \right\}, \quad \forall\, s \ \ \ \ \ (4)$

which by induction hypothesis is decreasing in ${\gamma}$ . Thus, ${v(s;\gamma)-\frac{\gamma}{1-\beta}}$ is non-increasing in ${\gamma}$ . Now, suppose that for a certain ${\gamma'}$ it is optimal to stop and retire. Then, for any ${\gamma''>\gamma'}$ ,

$\displaystyle 0\le v(s;\gamma'')-\frac{\gamma''}{1-\beta} \le v(s;\gamma') - \frac{\gamma'}{1-\beta}=0 \ \ \ \ \ (5)$

where the former inequality comes from optimality of ${v(s;\gamma'')}$ ; the latter stems from the non-increasing property just proven, and the last equality holds since retiring is optimal for ${\gamma'}$ . Then, ${v(s;\gamma'')=\frac{\gamma''}{1-\beta}}$ , hence it is optimal to retire for ${\gamma''}$ as well. The thesis follows. $\Box$

Optimal stopping time. It is convenient to define the optimal stopping time ${T^{\gamma}}$ as the random variable defining the number of steps before retirement under the optimal policy defined by Theorem 1 and when the subsidy equals ${\gamma}$ . Thus, ${T^{\gamma}}$ coincides with the first time that a state with Gittins index smaller than the subsidy ${\gamma}$ is met.

1.1. Five different interpretations for the Gittins index

The seemingly innocent concept of Gittins index defined in Theorem 1 is unexpectedly profound. In this section we show its multiple interpretations, while in the next section we discuss its crucial role for the design of a low-complexity solution for our initial multi-project problem.

First, we analyze the behavior of the optimal policy in state ${s}$ in a neighborhood of the Gittins index ${\overline{\gamma}(s)}$ . For ${\gamma <\overline{\gamma}(s)}$ , it is strictly preferable to continue working. For ${\gamma=\overline{\gamma}(s)}$ , the options (work / retire) are actually equally good. For ${\gamma}$ in a right (small) neighborhood of ${\overline{\gamma}(s)}$ , in general that the two options may still be equally good. Then, for ${\gamma}$ sufficiently higher than ${\overline{\gamma}(s)}$ , retiring is strictly preferable than working. This leads directly to the first two direct interpretations of the Gittins index.

Interpretation 1. The Gittins index ${\overline{\gamma}(s)}$ in state ${s}$ is the supremum of all those subsidy values such that working for at least one more step is strictly preferable than retiring in state ${s}$ , i.e.,

$\displaystyle \overline{\gamma}(s) = \sup\left\{\gamma: \, \sum_{t\ge 0} \beta^t \gamma < \mathbb E \Big[ \sum_{t\le T^{\gamma}} \beta^t r(s_t) + \sum_{t>T^{\gamma}} \beta^t \gamma \Big] \, \Big | \, s_0=s \right\}. \ \ \ \ \ (6)$

Interpretation 2. The Gittins index ${\overline{\gamma}(s)}$ in state ${s}$ is infimum of all those subsidy values such that working and retiring are equally good options in state ${s}$ , i.e.,

$\displaystyle \overline{\gamma}(s) = \inf \left\{ \gamma: \, \sum_{t\ge 0} \beta^t \gamma = \mathbb E \Big[ \sum_{t\le T^{\gamma}} \beta^t r(s_t) + \sum_{t>T^{\gamma}} \beta^t \gamma \Big] \, \Big | \, s_0=s \right\}. \ \ \ \ \ (7)$

We remark that both definitions only hold under the premise that the optimal strategy in Theorem 1 is followed all along.

By simply substracting ${\sum_{t\ge 0} \beta^t \gamma}$ to both terms in (7) we obtain a third interpretation for the Gittins index.

Interpretation 3. The Gittins index ${\overline{\gamma}(s)}$ is the fair charge that one pays at each step such that continue working until optimal stopping time is neither profitable nor loss making, when the subsidy coincides with ${\overline{\gamma}(s)}$ and under the optimal policy, i.e.,

$\displaystyle \overline{\gamma}(s) = \inf\left\{\gamma:\, \mathbb E \Big[ \sum_{t\le T^{\gamma}} \beta^t \big(r(s_t) - \gamma \big) \Big] = 0 \, \Big | \, s_0=s \right\}. \ \ \ \ \ (8)$

The following interpretation follows trivially from the previous one, but it is still worthwhile to be presented.

Interpretation 4. The Gittins index is the step-wise fixed reward that leads to an equivalent gain until optimal stopping time, when the subsidy coincides with ${\overline{\gamma}(s)}$ under the optimal policy, i.e.,

$\displaystyle \overline{\gamma}(s) = \inf\left\{\gamma:\, \mathbb E \Big[ \sum_{t\le T^{\gamma}} \beta^t r(s_t) \Big] = \mathbb E \Big[ \sum_{t\le T^{\gamma}} \beta^t \gamma \Big] \, \Big | \, s_0=s \right\}. \ \ \ \ \ (9)$

A related property, that will turn out to be crucial for the multi-project case, follows.

Property. Let ${T}$ be any stopping time, i.e., any rule deciding whether to retire in a certain state given the past history of states/actions/rewards. Then, the reward accumulated until ${T}$ cannot exceed the discounted sum of the Gittins index of the initial state until stopping time:

$\displaystyle \mathbb{E}\left[\sum_{t\le T} \beta^t r(s_t) \, \Big| \, s_0=s \right] \le \mathbb{E}\left[\sum_{t\le T} \beta^t \overline{\gamma}(s) \, \Big| \, s_0=s \right] \ \ \ \ \ (10)$

where equality holds if the stopping time ${T=T^{\overline{\gamma}(s)}}$ , as it stems from (9). In fact, if we assume by contradiction that (10) holds with ${>}$ sign, then for policy ${T}$ it is strictly profitable to continue working in state ${s}$ under subsidy ${\overline{\gamma}(s)}$ , which contradicts equation (7).

Finally, we turn to the arguably most common characterization of the Gittins index. Let us rewrite (7) as follows:

$\displaystyle \overline{\gamma}(s) = \inf \left\{ \gamma: \, \sum_{t\ge 0} \beta^t \gamma = \sup_{T}\left\{\mathbb E \Big[ \sum_{t\le T} \beta^t r(s_t) + \sum_{t>T} \beta^t \gamma \, \Big | \, s_0=s \Big] \right\} \right\} \ \ \ \ \ (11)$

$\displaystyle \quad = \inf\left\{\gamma:\, \sup_T \left\{\mathbb E \Big[ \sum_{t\le T} \beta^t r(s_t) - \sum_{t\le T} \beta^t \gamma \, \Big | \, s_0=s \Big] \right\}=0 \right\} \ \ \ \ \ (12)$

$\displaystyle \quad = \sup_{T}\left\{ \inf \left\{\gamma: \, \mathbb E \Big[ \sum_{t\le T} \beta^t r(s_t) - \sum_{t\le T} \beta^t \gamma \, \Big | \, s_0=s \Big] =0 \right\} \right\}. \ \ \ \ \ (13)$

where ${T}$ is any stopping time. Then, the last interpretation stems directly from (13).

Interpretation 5. The Gittins ${\overline{\gamma}(s)}$ is the best achievable step-wise reward over all stopping time rules from state ${s}$ , i.e.,

$\displaystyle \overline{\gamma}(s) = \sup_{T} \frac{\mathbb E \big[\sum_{t\le T} \beta^t r(s_t)\, \big| \, s_0=s\big]}{\mathbb E \big[\sum_{t\le T} \beta^t \, \big| \, s_0=s\big]}. \ \ \ \ \ (14)$

1.2. Gittins index computation

Among the algorithms computing the Gittins index that exist in the literature [4], here we discuss one of them, called the largest remaining index algorithm [5].

Assume a finite set of states ${S}$ . Call ${s^{1}}$ the states with highest associated reward, i.e., ${s^{1}=\mathrm{argmax}_s r(s)}$ . Then it is clear that the optimal stopping time is ${T^1=1}$ and ${s^{1}}$ has the highest Gittins index computed as

$\displaystyle \overline{\gamma}(s^{1}) = \max_{s\in S} r(s). \ \ \ \ \ (15)$

We relabel the states ${s^{1}, s^2,\dots}$ in decreasing order of Gittins index. Ties are broken randomly.
By induction hypothesis, let us assume that we have already computed the Gittins index of ${\overline{\gamma}(s^{1}),\dots,\overline{\gamma}(s^{k-1})}$ and we want to compute ${\overline{\gamma}(s^{k})}$ . Although we ignore the identity of the state ${s^{k}}$ and the value of its index ${\overline{\gamma}(s^{k})}$ , we know that ${\overline{\gamma}(s^{i})\ge \overline{\gamma}(s^{k})}$ for all ${1\le i< k}$ and ${\overline{\gamma}(s^{i})\le \overline{\gamma}(s^{k})}$ for ${i>k}$ . Hence, if we set the subsidy ${\gamma:=\overline{\gamma}(s^{k})}$ and the starting state is ${s^{k}}$ , then the optimal policy prescribes to work in states ${s^{1},\dots,s^{k-1}}$ and stop as soon as one of the remaining states ${S\setminus \{s^{1},\dots,s^{k-1}\}}$ is hit. Let ${T^{k}}$ be the associated stopping time. Then we compute:

$\displaystyle \overline{\gamma}(s^{k}) = \max_{s \in S\setminus \{s^{1},\dots,s^{k-1}\}} \frac{\mathbb E \big[\sum_{t\le T^{k}} \beta^t r(s_t)\, \big| \, s_0=s\big]}{\mathbb E \big[\sum_{t\le T^{k}} \beta^t \, \big| \, s_0=s\big]} \ \ \ \ \ (16)$

and ${s^{k}}$ as the argmax of the expression above. We now prove that (16) is legitimate. In fact, for ${s^{k}}$ , the formula in (16) coincides with (14). Moreover, for any other state ${s\in S\setminus \{s^{1},\dots,s^{k}\}}$ , we have that:

$\displaystyle \frac{\mathbb E \big[\sum_{t\le T^{k}} \beta^t r(s_t) \, \big| \, s_0=s\big]}{\mathbb E \big[\sum_{t\le T^{k}} \beta^t \, \big| \, s_0=s \big]} \le \overline{\gamma}(s) \le \overline{\gamma}(s^{k}) \ \ \ \ \ (17)$

where the first inequality stems from the fact that ${T^{k}}$ is not the optimal stopping time when starting in state ${s'}$ with subsidy ${\overline{\gamma}(s')}$ . We recap the procedure here below.

Largest remaining index algorithm [5].

Compute the largest Gittins index ${\overline{\gamma}(s^{1}) = \max_{s\in S} r(s)}$ and the corresponding state ${s_1}$ as its argmax.
For ${k=2,3,\dots}$ , define ${T^k}$ the stopping time obtained by retiring as soon as a state in ${S\setminus \{s^1,\dots,s^{k-1}\}}$ is encountered.
Compute the ${k}$ -th largest Gittins index ${\overline{\gamma}(s^{k})}$ as in (16) and the corresponding state ${s^k}$ as its argmax. Return to 2).

2. Solving the multi-project problem with linear complexity in the number of projects

After dissecting the notion of Gittins index, we now turn our attention to the original problem, consisting in selecting one out of ${N}$ projects at each time with the aim of maximizing the long-term discounted reward ${R}$ , as in (1).

As mentioned at the beginning of the post, the problem can be formulated as an MDP but the curse of dimensionality thwarts us from solving it even for a moderate number of projects.

We spoil immediately the main result [2], which is extremely elegant and provides an algorithm that compute an optimal project selection strategy with linear complexity in the number of projects.

Theorem 2 An optimal strategy for the multi-project problem always selects the project with the highest Gittins index.

In other words, at each time ${t}$ we select the project ${n^*=\mathrm{argmax}_n \overline{\gamma}^n(s_t^n)}$ . The state of project ${n^*}$ evolves according to a known Markov law, while the other states stay unchanged. Hence, at each time we only need to recompute the Gittins index of the selected project.

Formal proofs of Theorem 2 can be found in [6]. Next we provide the main intuitions.

Decreasing rewards. Suppose that the rewards provided by each project are non-increasing over time. Then, as intuition suggests, it is optimal to select at each step the project currently offering the highest reward.

Prevailing charge. Let us perform a thought experiment. For a given project, we imagine that a charge ${\delta_t}$ , that we call prevailing charge, is paid each time we work on the project. Such prevailing charge is defined as the minimum Gittins index seen so far:

$\displaystyle \delta_t = \min_{t'\le t} \overline{\gamma}(s_{t'}). \ \ \ \ \ (18)$

By definition, ${\delta_t}$ is non-increasing over steps ${t}$ . It stems from property (10) that the accumulated prevailing charges are an upper bound of the collected reward until a random stopping time ${T}$ . More formally, for any stopping time ${T}$ ,

$\displaystyle \mathbb{E}\left[\sum_{t\le T} \beta^t r(s_t) \, \Big| \, s_0=s \right] \le \mathbb{E}\left[\sum_{t\le T} \beta^t \delta_t \, \Big| \, s_0=s \right] \ \ \ \ \ (19)$

where equality holds in case the stopping time ${T}$ is such that ${\overline{\gamma}(s_{T+1})<\delta_T}$ , i.e., when the Gittins index falls below the last prevailing charge, which, by Theorem 1, is the optimal stopping time associated to a subsidy equal to ${\delta_T}$ . Hence, the prevailing charges act as an “equivalent reward”, which is non-increasing over time.

Putting the pieces together. Finally, we have managed to turn our problem into one where the “equivalent rewards” ${\delta_t}$ are non-increasing over time. In this case, we know what the optimal policy looks like, i.e., always selecting the project with highest prevailing charge, which by construction coincides with picking the project with the highest Gittins index.

More formally, let ${\pi^*}$ be the strategy-switching policy that always chooses the project with highest Gittins index and let ${R(\pi)}$ be the long-term discounted reward of strategy ${\pi}$ , as in (1). Then, for any project-switching policy ${\pi^t}$ , we have:

$\displaystyle R(\pi) \le \sum_{t\ge 0} \beta^t \delta^{\pi_t}_t \le \sum_{t\ge 0} \beta^t \delta^{\pi^*_t}_t = R(\pi^*) \ \ \ \ \ (20)$

where the first inequality stems from property (19), the second one can be derived from the non-increasing property of equivalent rewards ${\delta}$ , and the equality comes from the observation that (19) holds with equality if Gittins policy (prescribing to work on a project as long as its Gittins index is higher than the prevailing charge) is followed for each project.

References

[1] Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. John Wiley and Sons.

[2] Gittins, J., Glazebrook, K., Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley and Sons.

[3] Ross, S. M. (2014). Introduction to stochastic dynamic programming. Academic press.

[4] Chakravorty, J., Mahajan, A. (2014). Multi-Armed Bandits, Gittins Index, and its Calculation. Methods and applications of statistics in clinical trials: Planning, analysis, and inferential methods, 2, 416-435.

Retirement, Stopping times and Bandits: The Gittins index

Comments

Leave a comment Cancel reply

Retirement, Stopping times and Bandits: The Gittins index

Share this:

Comments

Leave a comment Cancel reply