Thompson Sampling
Created by Chia, Jonathan on Apr 09, 2022
Also referred to as the Bayesian Bandit
This article is written assuming a basic understanding of Bayesian statistics, A/B testing, and conjugate priors
This article is a summary of the lessons found in the below link:
https://udemy.com/course/bayesian-machine-learning-in-python-ab-testing/
Introduction:
Thompson Sampling is one of the best multi-armed bandit A/B testing algorithms. In my opinion, it is also the most beautiful to implement!
But before we jump into Thompson Sampling, here is a quick overview of multi-armed bandit algorithms.
Multi-Arm Bandit Algorithms:
============================
Multi-Arm Bandit algorithms address the key problem we see in A/B testing: half of the test subjects receive the worse treatment.
Let's say we are doing an experiment on a new cancer drug. In order to find out if the drug actually works, we have to split cancer patients into a control and a treatment group.
Do you see the problem here?? The A/B test is prioritizing figuring out if the drug is effective over saving as many people as possible.
This is the problem known as the Explore-Exploit Dilemma
Understanding the drug = Explore
Giving the most people possible the drug (assuming the drug is effective) = Exploit
We could give most of the patients the drug (exploit) but then we wouldn't have enough data to statistically prove that the drug is better than a control. It's all about finding the best balance!
Multi-Arm Bandit algorithms balance exploring and exploiting, and each algorithm does so in different ways.
Instead of pre-assigning two groups, multi-arm bandit algorithms adjust and update as the experiment goes on. So for this drug trial, instead of giving half the people the treatment at once, we would iteratively give the treatment or the control. Then, the algorithm would keep updating each iteration. Once the algorithm starts to see that the drug is better, the algorithm will assign more people the treatment instead of the control.
Thompson Sampling Intuition:
============================
Suppose we have 3 advertisements, each with a set but unknown probability of being clicked. We want to find out which advertisement has the highest click-through rate
Each advertisement has probability theta which can be represented using a beta distribution:
The reason we choose a beta is because it has a range from 0 to 1.
Each ad outputs either a click or no click (1 or 0). This is the likelihood:
Using bayes rule, we can find a posterior probability for the click-through rate
Because the beta distribution is a conjugate prior for Bernoulli likelihood, that means the posterior is also a beta distribution!
After we get more data, this posterior we just found, is used as a prior to find an updated posterior.
That's nice, but what does this have to do with A/B testing?
Think about the benefits of using Bayesian statistics. Each time we gather data on the ads, the estimated click-through rate probability can be updated. That means that as we test the ads, we will be able to figure out which ad is best during the experiment instead of at the very end.
Thompson Sampling Process:
==========================
Assign each slot machine a beta prior of beta(1,1) - a uniform distribution
We don't know anything about the slot machines' win rates so we assign an equally weighted probability distribution
Pull a sample probability from each of the ads' distributions
Whichever ad has the sampled highest probability, we display that to the next customer
This time the customer didn't click
Update that ad's probability distribution - the new probability distribution (posterior) becomes the prior for the next time you update
Initial probability distribution:
Updated probability distribution:
Repeat steps 1-5
What you end up getting is something like this:
The blue line - representing the worst ad (with a real probability of .2) - has a probability distribution that is denser on the left. Because it has lost 2 times, the probability is likely on the lower end
The red line - representing the best ad (with a real probability of .75) - has a probability distribution that is denser on the right because it has won two times.
Notice that we have shown the red ad 175 times! Why haven't we shown the other ads as much?
Look at step number 2 and 3 in the process. This is the key to the explore-exploit dilemma for the thompson sampling algorithm.
2. Pull a sample probability from each of the ads' distributions
3. Whichever ad has the highest sampled probability, we display that to the next customer
Based on the probability distributions, the algorithm decides which ad to test.
For example, when we run trial number 201, the algorithm will pull a sample from each of the three distributions.
The red distribution ranges from about .6 to .8 so this time the sample pulls a .75
The yellow distribution ranges from about 0 to .8 and this time the sample pulls a .4
The blue distribution has a sample pull of .1
The red distribution has the higher sample so it will be shown again, but we can see that sometimes the red distribution will not win due to randomness.
For example, let's say we run trial number 202.
The red distribution pulls a .60.
The yellow distribution pulls a .62
The blue distribution pulls a .04.
This time the yellow distribution wins so we will show the yellow ad next.
The algorithm is awesome because initially, it tests all three ads, but after we start to literally narrow down which ad is better, the algorithm will show most of the customers the red ad.
THEREFORE, THOMPSON SAMPLING EXPLORES WELL IN THE BEGINNING, FINDS THE BEST OPTION, AND THEN EXPLOITS THAT OPTION THE MOST AT THE END
Because the probability distributions have a spread, we allow for randomness that helps the algorithm to explore
Because the probability distributions narrow, the algorithm is able to exploit more
Code:
=====
For this code example, we are running a simulation to show that the algorithm actually works; therefore, we start with the true probabilities for the three ads.
We create a class with two functions. The pull function simulates an ad click based on the true probability of the ad. The update function updates the beta coefficients (which then updates the probability distribution) after each additional data point.
The update function above uses this posterior:
Then we run the experiment
Output:
The algorithm figures out the best ad at about 500 trials
Final distributions
Estimate of the true probability - the algorithm performed pretty well with the best ad, but was quite off for the 1st and 2nd ad. We didn't explore the 1st and 2nd ad enough to get close to the true probability but the benefit is we instead exploited the 3rd ad.
Document generated by Confluence on Apr 09, 2022 16:54
Last updated