Algorithm: Time per turn:

Visualize how different algorithms play for three machines where $\mu_1 = -2$ , $\mu_2 = 0$ , and $\mu_3 = 2$ . (Note: epsilon greedy uses $\epsilon=0.2$ )Imagine the following scenario: you have three slot machines in front of you. Each slot machine may or may not reward you when you pull its arm, and the probability that a machine will reward you is different for each machine. For example, machine #1 might give out a reward with a 50% chance, machine #2 might have a 99% probability of giving out a reward, and machine #3 might never give out a reward no matter how many times you play it. If you knew these probabilities beforehand, you would naturally want to pull machine #2 over and over again to get the most rewards. But what if you didn’t know these probabilities beforehand? How would you maximize your reward?

Try out your strategy. The boxes below each represent a slot machine -- click on one to pull its arm, and it will show you a running count of rewards and total pulls. On each click, a very simple algorithm is also running in the background to try to solve the same problem that you are. This algorithm, fittingly called “random pull”, just picks a random machine to pull each time. See if you can beat the algorithm, and if you can, try again with a different number of machines.

Number of machines:

The hypothetical problem stated at the outset is the basic setup of what is known as
the **multi-armed bandit (MAB) problem**.

The multi-armed bandit (short: bandit or MAB) can be seen as a set of real distributions , each distribution being associated with the rewards delivered by one of the levers. Let be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards.

Note that we used discrete rewards in the interactive component above (rewards are either 0 or 1), but for our remaining discussion, we will assume that rewards are defined by normal distributions with varying means but the same standard deviations. Our goal, however, remains the same -- without knowing the distribution of each machine, maximize the profit gained. Several algorithms have been designed to give approximate solutions to the MAB problem. For example, though it is far from effective, one algorithm that we have already seen is to just pick a random arm to pull for each turn. In this article, we will be focusing on two algorithms that are a bit more interesting:

- Epsilon greedy
- UCB1

First, let’s get a general idea of how each one works.

When you were trying out your own strategy to maximize your profit, what did you do? On one hand, since you wanted to get the most rewards possible, you probably favored machines that had a good history of rewards relative to the number of times you pulled that machine. Chances are, however, that you didn’t just stick to one machine. You probably also pulled each of the machines a couple of times to get an idea of how each one behaved. In other words, your strategy was probably a mix of:

*Exploring*: trying out different machines; after all, how are you supposed to know which machine is the “best” if you don’t give each one a shot?*Exploiting*: given the history of each machine, maximize your profit by picking the one with the best proportion of rewards to pulls.

Choosing the right mix of **exploration vs. exploitation** is a difficult balance to achieve.
Exploit too much, and you might miss out on the real best machine.
Explore too much, and you’ll waste turns on subpar machines. Different algorithms
aimed at solving the MAB problem go about balancing exploration and exploitation in
different ways. Our first strategy, the epsilon greedy strategy, essentially leaves
this problem up to the user to solve by having him/her define a
constant $\epsilon$ . $\epsilon$ is then used by the
algorithm in the following way:

- Choose a random machine to pull with probability = $\epsilon$ .
- Choose the best machine to pull with probability = $1 - \epsilon$ .

Here, the algorithm defines the “best” machine very simply -- it is just the one with the highest experimental mean, where the experimental mean is calculated as the sum of the rewards from that machine divided by the number of times that machine has been pulled.

The experimental mean of machine after turns iswhere is the reward given by machine at timestep and is the number of times that machine has been pulled across total turns.

Pseudocode for the algorithm:

```
for t = 1...numTurns:
flip a coin with probability epsilon of heads and 1 - epsilon of tails
if heads:
pull a random arm
else:
calculate the arm with the best experimental mean
pull the arm with the best experimental mean
```

The below component allows you to visualize how the experimental means change as the algorithm runs. Try out different settings for the number of arms, epsilon, and the standard deviation of the arms -- do some settings result in the algorithm finding good estimates for the experimental means faster?

Number of arms:

$\epsilon$ : 0.10 $\sigma$ : 0.50

Time per turn:

The UCB1 algorithm is described as **“optimism in the face of uncertainty.”**
Given a history of rewards and a number of
pulls for each arm, the UCB1 algorithm calculates an “upper confidence bound” (UCB) which
it uses to decide which arm to pull. The optimistic aspect of this algorithm comes
from the fact that despite being uncertain about the estimated means of the
distributions of the $k$ arms, at each timestep $t$
it always selects the arm $i$ that maximizes
the sum $\bar{x}_i + a(i, t)$ , where $\bar{x}_i$ is the
estimated mean of machine $i$ and $a(i, t)$ is the
UCB of machine $i$ at timestep $t$ .

The upper confidence bound of machine at timestep iswhere is the number of times that machine has been pulled at timestep .

Pseudocode for the algorithm:

```
first k turns: initialize experimental means by pulling each arm once
for t = k+1...numTurns:
for i = 1...k:
calculate a(i, t) = 2 * log(t / p(i, t))
pull arm i that maximizes experimental mean + a(i, t)
```

One of the most important features of the UCB is that it not only exponentially decays as the number of pulls on the given machine increases, but also increases as the timestep increases. In other words, arms that have been explored less are given a boost even if their estimated mean is low, especially if we’ve been playing for a while. In this way, the UCB1 algorithm is able to naturally define its own mix of exploration vs. exploitation without depending on a user supplied parameter like epsilon greedy.

The below graph visualizes the UCB for a specific arm at a given timestep $t$ -- notice the shape of the line as the timestep and number of pulls for that arm increases.

Timestep $t$ : 1.00

Now that we know what each algorithm does, let’s find out how well each one performs.

To evaluate the quality of an algorithm, we first have to define how we will measure its performance. One commonly used measure is “regret”. Simply put, regret measures the difference between what could have been achieved by always making the best decision (ie: the one that maximizes reward) and what the algorithm actually chose to do.

Suppose that, for , where is the total number of slot machines, slot machine gives out rewards that are normally distributed with mean and standard deviation ( is the same across all machines). Then we define the total cumulative regret over turns as:where is the reward gained from the decision of the algorithm at timestep and is the highest mean (or, equivalently, the highest expected payout) of all machines.

In the definition above, the right term is the sum of all of the actual rewards
that we gained, while the left term represents the total reward that we could
have gained by always choosing the best arm. Since regret measures how far off the
algorithm was from making the best series of choices, we can say that a “better”
algorithm is one that minimizes its regret. Keep this in mind for all subsequent
discussions -- **lower regret means better performance**.

Epsilon greedy has an $\mathcal{O}(T)$ theoretical bound on its total regret, where $T$ is the total number of turns. This matches our intuition -- since the algorithm always chooses the best arm with probability $1 - \epsilon$ and randomly chooses an arm with probability $\epsilon$ no matter what turn it is on, we expect that the regret will scale linearly with the total number of turns that we play.

The proof of UCB1′s theoretical performance is less intuitive than that of epsilon greedy. Interested readers should refer to Auer, Cesa-Bianchi & Fisher (2002) for a full proof, but for our discussion, it will suffice to know that the total regret of UCB1 over $T$ turns and $k$ arms has a $\mathcal{O}(\sqrt{kT\log(T)})$ bound.

Theoretically, UCB1 is therefore expected to outperform epsilon greedy in the long run. Of course, with a bound like $\mathcal{O}(\sqrt{kT\log(T)})$ , this may not be immediately obvious. Try visualizing the theoretical upper limits for each algorithm in the below interactive graph.

$k$ (number of arms): $T$ (number of steps):

Up until now, we have only considered the theoretical performance of each algorithm. Do these bounds match up with what we see in practice?

Try running your own experiments to find out. Below, you can test how the values of different parameters affect the performance of the epsilon greedy and UCB1 algorithms when run on machines with the same reward distributions. The lines on the chart represent the average total regret after each timestep on the x-axis, where averages are taken across the number of experiments that you specify. Can you find settings where epsilon greedy outperforms UCB1 or vice versa?

Number of experiments:

Arms: Turns per experiment:

Std. dev: 0.10 Epsilon: 0.10

As you may have already discovered by yourself, although UCB1 should theoretically outperform epsilon greedy, in practice, this is not always the case. The panel of graphs below display results for preconfigured combinations of $k$ , the number of arms, and $\sigma$ , the standard deviation across all arms. The left column of graphs shows empirical results (average total regret was taken over 100 experiments with each consisting of 1,000 turns), and the right column shows our predictions based on the theoretical bounds of each algorithm. Note that $\epsilon$ is held constant at $0.2$ for epsilon greedy.

Each click of the buttons below runs a fresh set of experiments for a certain value of $k$ .
In comparing performance, remember again that **lower regret is better**.

In practice, we see that UCB1 tends to outperform epsilon greedy when the number of arms is low and the standard deviation is relatively high, but its performance worsens as the number of arms increases.

This article has explored two approaches to solving the MAB problem: epsilon greedy and UCB1. Both algorithms take different approaches to finding a balance between exploring and exploiting, sometimes with very different end results. In particular, although UCB1′s theoretical performance suggests that it should outperform epsilon greedy, the simpler algorithm is nevertheless still sometimes better in practice. However, if we are faced with a small number of arms and know beforehand that the reward distributions have relatively high standard deviations, UCB1 is likely the better algorithm to choose.

Now that you know more about the two algorithms, see if you can beat each one!

Number of machines: Algorithm:

Besides the two algorithms that we discussed, other strategies for solving the MAB problem exist. For those interested in learning more, we recommend starting with the following:

- Boltzmann exploration (SoftMax): similar to the algorithms discussed above, but each arm is pulled with a probability that is proportional to the average reward given by that arm.
- Adaptive epsilon-greedy based on value differences (VDBE): an extension of the epsilon greedy algorithm that uses a state dependent exploration probability, therefore taking the responsibility of balancing exploration vs. exploitation out of the hands of the implementer.
- LinUCB: an algorithm aimed at solving a variant of the MAB problem called the
*contextual*multi-armed bandit problem. In the contextual version of the problem, each iteration is associated with a feature vector that can help the algorithm decide which arm to pull.