The Multi-Armed Bandit Problem

An exploration of epsilon greedy and UCB1

Algorithm: Time per turn:

Visualize how different algorithms play for three machines where μ1=2\mu_1 = -2 , μ2=0\mu_2 = 0 , and μ3=2\mu_3 = 2 . (Note: epsilon greedy uses ϵ=0.2\epsilon=0.2 )


Imagine the following scenario: you have three slot machines in front of you. Each slot machine may or may not reward you when you pull its arm, and the probability that a machine will reward you is different for each machine. For example, machine #1 might give out a reward with a 50% chance, machine #2 might have a 99% probability of giving out a reward, and machine #3 might never give out a reward no matter how many times you play it. If you knew these probabilities beforehand, you would naturally want to pull machine #2 over and over again to get the most rewards. But what if you didn’t know these probabilities beforehand? How would you maximize your reward?

Try out your strategy. The boxes below each represent a slot machine -- click on one to pull its arm, and it will show you a running count of rewards and total pulls. On each click, a very simple algorithm is also running in the background to try to solve the same problem that you are. This algorithm, fittingly called “random pull”, just picks a random machine to pull each time. See if you can beat the algorithm, and if you can, try again with a different number of machines.

Number of machines:

The hypothetical problem stated at the outset is the basic setup of what is known as the multi-armed bandit (MAB) problem.

Definition: Multi-armed Bandit (MAB) Problem
The multi-armed bandit (short: bandit or MAB) can be seen as a set of real distributions , each distribution being associated with the rewards delivered by one of the levers. Let be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards.

Note that we used discrete rewards in the interactive component above (rewards are either 0 or 1), but for our remaining discussion, we will assume that rewards are defined by normal distributions with varying means but the same standard deviations. Our goal, however, remains the same -- without knowing the distribution of each machine, maximize the profit gained. Several algorithms have been designed to give approximate solutions to the MAB problem. For example, though it is far from effective, one algorithm that we have already seen is to just pick a random arm to pull for each turn. In this article, we will be focusing on two algorithms that are a bit more interesting:

First, let’s get a general idea of how each one works.


Epsilon Greedy

When you were trying out your own strategy to maximize your profit, what did you do? On one hand, since you wanted to get the most rewards possible, you probably favored machines that had a good history of rewards relative to the number of times you pulled that machine. Chances are, however, that you didn’t just stick to one machine. You probably also pulled each of the machines a couple of times to get an idea of how each one behaved. In other words, your strategy was probably a mix of:

Choosing the right mix of exploration vs. exploitation is a difficult balance to achieve. Exploit too much, and you might miss out on the real best machine. Explore too much, and you’ll waste turns on subpar machines. Different algorithms aimed at solving the MAB problem go about balancing exploration and exploitation in different ways. Our first strategy, the epsilon greedy strategy, essentially leaves this problem up to the user to solve by having him/her define a constant ϵ\epsilon . ϵ\epsilon is then used by the algorithm in the following way:

Here, the algorithm defines the “best” machine very simply -- it is just the one with the highest experimental mean, where the experimental mean is calculated as the sum of the rewards from that machine divided by the number of times that machine has been pulled.

Definition: Experimental Mean
The experimental mean of machine after turns is
where is the reward given by machine at timestep and is the number of times that machine has been pulled across total turns.

Pseudocode for the algorithm:

for t = 1...numTurns:
  flip a coin with probability epsilon of heads and 1 - epsilon of tails
  if heads:
    pull a random arm
    calculate the arm with the best experimental mean
    pull the arm with the best experimental mean

The below component allows you to visualize how the experimental means change as the algorithm runs. Try out different settings for the number of arms, epsilon, and the standard deviation of the arms -- do some settings result in the algorithm finding good estimates for the experimental means faster?

Number of arms:

ϵ\epsilon : 0.10 σ\sigma : 0.50

Time per turn:


The UCB1 algorithm is described as “optimism in the face of uncertainty.” Given a history of rewards and a number of pulls for each arm, the UCB1 algorithm calculates an “upper confidence bound” (UCB) which it uses to decide which arm to pull. The optimistic aspect of this algorithm comes from the fact that despite being uncertain about the estimated means of the distributions of the kk arms, at each timestep tt it always selects the arm ii that maximizes the sum x¯i+a(i,t)\bar{x}_i + a(i, t) , where x¯i\bar{x}_i is the estimated mean of machine ii and a(i,t)a(i, t) is the UCB of machine ii at timestep tt .

Definition: Upper Confidence Bound (UCB)
The upper confidence bound of machine at timestep is
where is the number of times that machine has been pulled at timestep .

Pseudocode for the algorithm:

first k turns: initialize experimental means by pulling each arm once
for t = k+1...numTurns:
  for i = 1...k:
    calculate a(i, t) = 2 * log(t / p(i, t))
  pull arm i that maximizes experimental mean + a(i, t)

One of the most important features of the UCB is that it not only exponentially decays as the number of pulls on the given machine increases, but also increases as the timestep increases. In other words, arms that have been explored less are given a boost even if their estimated mean is low, especially if we’ve been playing for a while. In this way, the UCB1 algorithm is able to naturally define its own mix of exploration vs. exploitation without depending on a user supplied parameter like epsilon greedy.

The below graph visualizes the UCB for a specific arm at a given timestep tt -- notice the shape of the line as the timestep and number of pulls for that arm increases.

UCB vs. Number of Pulls at Timestep t

Timestep tt : 1.00

Performance Analysis

Now that we know what each algorithm does, let’s find out how well each one performs.

Theoretical Performance

To evaluate the quality of an algorithm, we first have to define how we will measure its performance. One commonly used measure is “regret”. Simply put, regret measures the difference between what could have been achieved by always making the best decision (ie: the one that maximizes reward) and what the algorithm actually chose to do.

Definition: Regret
Suppose that, for , where is the total number of slot machines, slot machine gives out rewards that are normally distributed with mean and standard deviation ( is the same across all machines). Then we define the total cumulative regret over turns as:
where is the reward gained from the decision of the algorithm at timestep and is the highest mean (or, equivalently, the highest expected payout) of all machines.

In the definition above, the right term is the sum of all of the actual rewards that we gained, while the left term represents the total reward that we could have gained by always choosing the best arm. Since regret measures how far off the algorithm was from making the best series of choices, we can say that a “better” algorithm is one that minimizes its regret. Keep this in mind for all subsequent discussions -- lower regret means better performance.

Epsilon Greedy

Epsilon greedy has an O(T)\mathcal{O}(T) theoretical bound on its total regret, where TT is the total number of turns. This matches our intuition -- since the algorithm always chooses the best arm with probability 1ϵ1 - \epsilon and randomly chooses an arm with probability ϵ\epsilon no matter what turn it is on, we expect that the regret will scale linearly with the total number of turns that we play.


The proof of UCB1′s theoretical performance is less intuitive than that of epsilon greedy. Interested readers should refer to Auer, Cesa-Bianchi & Fisher (2002) for a full proof, but for our discussion, it will suffice to know that the total regret of UCB1 over TT turns and kk arms has a O(kTlog(T))\mathcal{O}(\sqrt{kT\log(T)}) bound.

Theoretically, UCB1 is therefore expected to outperform epsilon greedy in the long run. Of course, with a bound like O(kTlog(T))\mathcal{O}(\sqrt{kT\log(T)}) , this may not be immediately obvious. Try visualizing the theoretical upper limits for each algorithm in the below interactive graph.

Regret Bound vs. Turn

kk (number of arms): TT (number of steps):

Empirical Performance

Up until now, we have only considered the theoretical performance of each algorithm. Do these bounds match up with what we see in practice?

Try running your own experiments to find out. Below, you can test how the values of different parameters affect the performance of the epsilon greedy and UCB1 algorithms when run on machines with the same reward distributions. The lines on the chart represent the average total regret after each timestep on the x-axis, where averages are taken across the number of experiments that you specify. Can you find settings where epsilon greedy outperforms UCB1 or vice versa?

Average Total Regret vs. Turn

Number of experiments:

Arms: Turns per experiment:

Std. dev: 0.10 Epsilon: 0.10

As you may have already discovered by yourself, although UCB1 should theoretically outperform epsilon greedy, in practice, this is not always the case. The panel of graphs below display results for preconfigured combinations of kk , the number of arms, and σ\sigma , the standard deviation across all arms. The left column of graphs shows empirical results (average total regret was taken over 100 experiments with each consisting of 1,000 turns), and the right column shows our predictions based on the theoretical bounds of each algorithm. Note that ϵ\epsilon is held constant at 0.20.2 for epsilon greedy.

Each click of the buttons below runs a fresh set of experiments for a certain value of kk . In comparing performance, remember again that lower regret is better.

k=3k=3 , σ=0.01\sigma = 0.01 :

k=3k=3 , σ=0.1\sigma = 0.1 :

k=3k=3 , σ=1\sigma = 1 :

k=3k=3 , σ=5\sigma = 5 :

In practice, we see that UCB1 tends to outperform epsilon greedy when the number of arms is low and the standard deviation is relatively high, but its performance worsens as the number of arms increases.


This article has explored two approaches to solving the MAB problem: epsilon greedy and UCB1. Both algorithms take different approaches to finding a balance between exploring and exploiting, sometimes with very different end results. In particular, although UCB1′s theoretical performance suggests that it should outperform epsilon greedy, the simpler algorithm is nevertheless still sometimes better in practice. However, if we are faced with a small number of arms and know beforehand that the reward distributions have relatively high standard deviations, UCB1 is likely the better algorithm to choose.

Now that you know more about the two algorithms, see if you can beat each one!

Number of machines: Algorithm:

Besides the two algorithms that we discussed, other strategies for solving the MAB problem exist. For those interested in learning more, we recommend starting with the following: