Contextual multi-armed bandits for content recommendation, or not by Bernoulli alone

22 min readJan 9, 2023

Why the “classic” ML recommendations are lacking, how to promote content on social media using multi-armed bandits, and how to take into account the context in this process.

My name is Alexander Sukhochev, I work in machine learning and lead the team handling recommendations and development of VK services. I want to share our experience and results from embedding contextual multi-armed bandits for content recommendation using games and stickers as an example.

This article consists of four parts. If you are familiar with the topic, you may go directly to the second or third part, or read it in its entirety to get the full picture.

Introduction talks about the existing approaches to building recommender systems and what multi-armed bandits have to do with it. This section is for those who are not familiar with this approach.
Basic algorithms for solving the multi-armed bandit problem: Epsilon-Greedy Approach, Thompson Sampling, Upper Confidence Bound.
Contextual multi-armed bandits algorithm: this section covers contextual multi-armed bandits and how they are trained in a particular scenario that we used in our solution.
Implementation Notes: describes the implementation details, business requirements and outcomes using the game and sticker recommendation service as an example.

Introduction

A top-level overview of a typical recommender system looks like this:

There is a history of interaction between the user uₛ ∈ U and content item iᵣ ∈ I.
Depending on business objectives, some kinds of interaction are considered positive while some are negative. That is, for each pair of interactions (uₛ, iᵣ) we assign the answer yₛᵣ ∈ Y and optionally the weight wₛᵣ.
Each pair (uₛ, iᵣ) is assigned the attributes xₛᵣ ∈ X (descriptions of the user or content or just their id).
Our task is to teach the model to recognize which new pairs of users and pieces of content have a chance to interact positively in the future, based on X and Y.

What are “classic” ML recommendations lacking?

Recommendations based solely on the previous interactions between the user and content and non-Bayesian frequency approach to learning (for example, various algorithms based on matrix factorizations, boosting, etc.) may be good, but there are certain shortcomings with these approaches that are hard to combat:

Algorithms learn from the history which they themselves influenced. However, this fact is not taken into account anywhere and can therefore lead to undesirable results during training.
It becomes more difficult to find a niche for potentially high-quality, but authentic content.
Recommendations focus on content topics that were once of interest to the user instead of offering new topics.
It is not clear what to do with the constant appearance of new content on the platform that doesn’t provide any history of interaction to learn from.

All of these problems require a solution. The last three in particular can have a profound impact on the users’ view of the service. If we can help it, we don’t want to waste our users’ time on new content that’s not high-quality and (or) suitable.

In literature, this problem is called exploitation-exploration trade-off:

On the one hand, we want to benefit from the recommender system by applying existing knowledge. This is exploitation.
On the other hand, by acting in such a “greedy” way, we slowly gain new knowledge that could potentially help us extract even more benefits in the future. Therefore, it would be better sometimes to take risks and try new behavior patterns, offer new content or recommend content to an unusual user group. This is exploration.

What are multi-armed bandits?

In machine learning, the problem of a multi-armed bandit is the problem of efficiently combining exploration and exploitation. It got its name because of its similarity to a casino game:

The player comes to a casino with slot machines (also known as one-armed bandits) and naturally wants to win as much as possible.
For each of the player’s attempts to pull the arm, the slot machine pays off a set amount of money with a certain probability, but each one-armed bandit has its own payoff rate.
It is known for a fact that there is a one-armed bandit with a payoff rate that makes it worth looking for it.
Thus, the goal of the player is to find the most profitable one-armed bandit as quickly as possible and play it instead of the others.

Figure 1: This is how slot machines usually look in real life. The only difference in our case is that the answer is not a combination of three pictures, which result in different payoffs, but is a binary value (whether player won a fixed amount or not).

Since there are many one-armed bandits, it is usually said that we are dealing with one multi-armed bandit, each of whose arms is an arm of a one-armed bandit.

In applied tasks, each arm of a multi-armed bandit is not a choice of the next slot machine to play with, but a choice of one action from a discrete set that should maximize the business metric. Actions and metrics vary depending on field and goal. For example, when it comes to recommendations, the action can be either showing the user a specific piece of content, or choosing a top-level principle according to which recommendations for the user are to be generated in a given time.

The goal is to identify an effective strategy as quickly as possible and then stick with it.

Basic algorithms for solving the multi-armed bandit problem

To understand the algorithms, we first have to formalize the problem. There are different models, but in a general case, the problem of multi-armed bandits can be represented as a diagram, as shown in Figure 2.

*Figure 2: The principle of the multi-armed bandit problem (diagram borrowed from web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf)*

There is an agent (online decision algorithm), an environment (system) and a potentially infinite sequence of interaction iterations between the agent and the environment, numbered with index t. At each t-th step, the agent decides to perform the action aₜ, the environment in response to this action gives the answer yₜ, which directly or indirectly (through the function r(y)) characterizes the agent’s payoff.

In general, the signal yₜ is stochastic (nondeterministic) in nature. We can say that the action of the agent aₜ at the t-th step sets the probability distribution for the answer 𝕐ₜ, from which a specific implementation of the answer yₜ is obtained by random sampling (in other words, particular implementations of a random variable are taken). Next, based on the newly performed action aₜ and the received signal yₜ, we update the approximations θ̂ of the parameters θ of the model that describes the dependence of the environment’s responses on the actions performed (supervised learning block in the diagram). In what follows, we will call this element the model of the environment for short. Then, based on the latest approximations, the algorithm that decides what actions to perform (the optimizer) predicts the next step to be taken. This is how each iteration of the agent’s interaction with the environment looks like.

The casino problem is perhaps one of the simplest examples of the multi-armed bandit problem, namely, the Bernoulli multi-armed bandit problem. In this problem, the model for the dependence of the environment responses on the actions performed is a set of Bernoulli K-distributions: one distribution per arm aₖ, k ∈ {1,…,K}. Each k-th Bernoulli distribution is a distribution of a random variable that can be equal to one (win) with probability θₖ or to zero (loss) with probability 1 – θₖ. These θₖ, k ∈ {1,…, K} are the parameters of this model. Despite its simplicity, this model can be used to solve applied problems, and we will analyze the main approaches to solving the multi-armed bandit problem using it as an example. Next, we will describe its shortcomings and ways to make it more complex that allow to get rid of them.

Epsilon-greedy approach

The simplest approach is the epsilon-greedy algorithm. In general, it works like this:

With ε probability, we choose the bandit’s arm, which, based on the already existing data, has the highest average payoff, and with 1–ε probability — a random arm.
Update mathematical expectations of payoffs for all arms.
Repeat from the first step.

This approach is applied to the Bernoulli bandit problem in a trivial way:

The ratio of wins to the number of attempts is calculated for each arm.
The arm with the maximum ratio is used with an ε probability — this is exploitation, a random arm is used with a probability of 1–ε — which is exploration.
Ratio is updated based on the new data obtained

The disadvantage of this approach is irrational exploration. As new data becomes available, it becomes clear that some arms are more promising for research than others. However, they still get equal chances of being recommended to the user at the next iteration of exploration.

For clarity, let’s say that we want to recommend pictures of cats.

After some time, we have collected the following data: the first one got 2,000 likes and 100,000 views, the second got no likes and only 10 views, the third got 1 like and 10,000 views. Here we see likes as payoff and views as attempts.

At the exploitation stage, the first cat will be selected as the most promising one based on existing information.

Now let’s talk about exploration. The second photo received fewer likes both in absolute terms and in relation to the number of views than the third. However, based on the number of the views, we can make an assumption that the study of the demand for the second photo is more promising, and the lack of demand for the third photo in recommendations is already very likely. Within the epsilon-greedy approach, both pictures will have the same chance of being recommended to the user during the exploration stage.

Of course, you can try to overcome such inefficiency with various heuristics, such as limitations on the number of views and the wins to attempts ratio. But it’s unreliable.

Bayesian theoretical minimum

To further explore how multi-armed bandits work, we have to find a way to train models that will allow:

Combining new data with already existing beliefs about the environment in a streaming mode (that is, without storing the training sample in memory and constantly reusing it), and obtaining a new, more accurate model of the environment as a result.
Obtaining not just point estimates of the model parameters, but also their probability distribution, which helps to understand how confident we are in the current estimates.

The Bayesian approach fits these requirements. Its whole essence can be expressed in one small formula, which is the Bayesian formula itself:

Instead of abstract A and B, we can immediately substitute the notions relevant for machine learning with a teacher A = Y|X (model responses subject to factors) and B = θ (machine learning model parameters). As a result, we get the following:

where p(θ) stands for the current representations of the model, expressed as the distribution density of the model parameters and based on previous data and basic a priori representations (or solely on a priori representations, if there is no data available yet). In literature, this factor is called the prior distribution.

p(Y|θ, X) contains information about the newly received data in the form of their likelihood function. The likelihood is exactly how this factor is called in literature.

p(θ|Y, X) is our updated view of the model based on the old model combined with new data. In literature, this factor is called the posterior distribution.

The factor

is required for normalization. If it is removed, then

cannot be considered as the distribution density of parameters, since the integral over it will not have one as its sum.

Sometimes, the normalizing factor can be calculated analytically, in which case we say that we are conducting a Bayesian inference. However, save for the simplest cases, it is extremely difficult or simply impossible to calculate it analytically. Numerical calculation can be very time consuming, because of which we’d try to approximate the posterior distribution p(θ|X) to some distribution q(θ), which is more convenient to work with to estimate density and take samples, thus avoiding difficult calculations of normalizing constants. This approach is called approximate variational inference.

In practice, both of these methods are used, and we will explore their application later in this article. It is also possible to use Markov chains and the Monte Carlo method. However, they are not often used for multi-armed bandits, so we will not go into too much detail.

Let’s see how the posterior distribution in the Bernoulli bandit problem is derived by an analytical calculation of the normalizing constant. In other words, we’ll take a look at how the Bayesian inference is drawn. In this case, we don’t have any contextual factors, so instead of Y|X, let’s consider only Y.

The θ parameter is a set of θₖ parameters, each of which in this case describes the probability of getting a reward when using the k-th arm and thus sets the following data distribution density:

Yₖ are the answers to the attempts to use the k-th arm. yₖᵢ — a response to the i-th iteration of using the k-th arm.

As the a priori distribution of the parameters θₖ we use the beta distribution:

αₖ and βₖ are the parameters of the k-th beta distribution (i.e., the distribution parameters of the parameter θₖ). B(α, β) is a beta function, the particular form of which is not important in this case. Among all families of distributions, we chose beta distributions because:

it sets the distribution of the value on the interval from 0 to 1 (and θ can only take values from this interval).
If such a prior distribution is used, then the posterior distribution is calculated analytically (it will be demonstrated later on).

*Types of beta distribution for different parameter values, picture from Wikipedia*

Updating the posterior distribution when new data is collected is very simple, in fact, the posterior distribution, same as the prior here is a beta distribution and has the following parameters:

Thus, the posterior distribution is calculated analytically and therefore extremely quickly.

Brief proof of this fact:

In this case,

but we can ignore it, because our posterior distribution looks like this:

i.e. the posterior distribution resembles the beta distribution up to a constant (but now, αₖ becomes ∑ᵢ [yₖᵢ = 1]+αₖ, and βₖ becomes ∑ᵢ [yₖᵢ = 0]+βₖ.

Thus, we can easily restore the normalizing constant C’ = B (∑ᵢ [yₖᵢ = 1]+αₖ, ∑ᵢ [yₖᵢ = 0]+βₖ) — just by taking it from the definition of the beta distribution with updated parameters.

A small comment: in cases when the prior and posterior distributions belong to the same family, as it is here, the prior is said to be conjugate to the likelihood distribution. Pairs of conjugate distributions, as shown above, are easy to work with, but it is often difficult to find a prior distribution that is conjugate to the likelihood.

Thompson sampling

In general, Thompson sampling works like this:

The values are sampled from the given posterior distribution of the environment parameters θ.
Considering the results of sampling as the true values of the environment parameters, we choose the action that leads to the highest mathematical expectation of payoff.
Having received new data, we recalculate the updated posterior distribution of the parameters θ, considering the previous posterior distribution as a priori.
We repeat everything from the first point.

Applied to the Bernoulli bandit’s problem, this algorithm looks as follows. The environment is described by a set of parameters θₖ (one for each k-th arm), each parameter has a prior beta distribution. At each iteration we do the following:

Sample the values Θₖ ∼ p(θₖ) from each beta distribution of parameters θₖ of each arm.
The sampled values Θₖ are considered as the probability of payoff from each arm, thus, the optimal action is argmaxₖ Θₖ.
Having received new information from the environment in response to our actions, we recalculate the values of the beta distribution of parameters as described above.
We repeat everything from the first step.

Thompson sampling implements a much more efficient exploration-exploitation tradeoff strategy. Intuitively, it can be understood this way: when we sample Θₖ from distributions p(θₖ), then the largest values are obtained either from distributions with large mathematical expectation (in this case, we are closer to exploitation), or from distributions with large variance (in this case, we are closer to exploration).

Referring to the previous example of cats, in the situation described above, using this approach, the most likely pictures to be chosen are:

either the first picture with α = 2000 + 1 = 2001 (we assume that the a priori value α = 1), β = 100000–2000 + 1 = 98001 (we assume that the a priori value β = 1), and therefore, the expectation

(by the beta distribution mean). This mathematical expectation places the beta distribution of the photo of the first cat further to the right from all the other beta distributions. And samples from this distribution in average are bigger than from other distributions.

alternatively, the second photo, with α = 0 + 1 = 1, β = 10–0 + 1 = 11, and therefore the variance equal to

With such a relatively large variance, the probability of large Θᵢ values occurring is high.

The third picture has very small chances of being selected, since it has both a small mathematical expectation

and a small variance

In research, decision-making algorithms, including the multi-armed bandit problem, are usually evaluated using

where T is the number of decision iterations, t is the number of the current iteration, E(yₜ|a*) is the mathematical expectation of the random variable yₜ if the optimal action a* is performed, E(yₜ|aᵗ ) is the mathematical expectation of the random variable yₜ if the action aᵗ is performed by our algorithm.

For example, in this article, Thompson sampling is empirically compared with the epsilon-greedy algorithm using many examples (Regret used as a quality metric).

Upper Confidence Bound

Another popular algorithm for solving the multi-armed bandit problem is the Upper Confidence Bound (UCB) algorithm. In short, its goal at each stage is to choose an arm with the largest weighted sum of the expected payoff and the component expressing the uncertainty of the prediction.

This principle of choosing actions at each step is intuitively similar to the principle in Thompson sampling. There are theoretical estimates of the UCB algorithm’s performance (regret), but we will not analyze it in more detail, since in reality, in the conditions of batch updating of environment parameters and recommendation generation, the Thompson sampling algorithm allowed us to conduct more exploration.

Algorithm of contextual multi-armed bandits

In reality, one action is rarely better than another in any given situation. Usually just about everything depends on the context. In recommender systems, the context that needs to be taken into account is usually the features of the user.

Naive approach

The easiest way to take context into account is to divide the user feature space into separate cells by feature values to then count the “bandit” statistics for each cell.

In the illustration below, the user space is divided into cells, which contain cases of user interaction with content when the user had a feature corresponding to this cell.

The circles represent user interactions with the content in the history. For simplicity, we take into account only two factors: gender and age.

For each cell, there is a set of k beta distributions of the parameters θ₁ … θₖ that is recalculated based on the data contained in it, independently of the other cells. In other words, for each arm, we calculate an M number of beta distributions, where M is the number of cells.

Thus, instead of K number of distributions in the case when we do address the context, we have to consider and recalculate K * M distributions.

This approach has several drawbacks:

It is not clear how to correctly divide the space by features. In the example above, splitting by age can be done expertly or based on the accepted classification. But if you add, for example, thematic descriptions of the user, then the task becomes more difficult.
When we increase the number of addressed features, we divide the space into more cells and there are fewer precedents for learning in each cell.

Logistic regression + variational inference

It’s time to fundamentally change the model of the environment. If earlier the probability of a binary win when using the k-th arm depended on one θₖ parameter, now we introduce a vector of parameters θₖ = (θₖ¹, …, θₖᵛ) for each arm. Let the context be characterized by the feature vector x = (x¹, …, xᵛ), then the probability of winning when using the k-th arm will be equal to

the sigmoid here ensures that the value we get from the model will be between 0 and 1.

The likelihood for each k-th arm is also calculated separately and looks like this:

Yₖ here stand solely for the answers to the attempts to use the k-th arm, yₖᵢ is the answer to the i-th iteration of using the k-th arm, xₖᵢ is the contest on the i-th iteration of using the k-th arm.

In this case, it is already difficult to choose an a priori distribution that would match the likelihood distribution and allow analytical calculation of the posterior distribution.

Therefore, as a priori distribution, we simply take the most popular one, which is the multivariate normal distribution:

the posterior distribution will be obtained using an approximate variational derivation, namely, the Laplace approximation.

This approach means that, if we are dealing with a unimodal doubly differentiable distribution in which a significant part of the probability mass is concentrated around the mode, then it may be reasonable to approximate this distribution to a normal distribution.

The mathematical expectation of this approximation will be the mode of the posterior distribution, which can be easily found by any optimization algorithm:

The covariance matrix of the approximate posterior distribution is found by the formula:

Derivation of formulas

In order to find

Let’s consider separately the numerator

and its logarithm

After we find θₖ*, it will be possible to represent g(θₖ) in the vicinity of this point using the Taylor formula up to the second order.

since θₖ* is the extremum g’(θₖ*) = 0.

Thus, we get:

Taking the exponential from both parts, we get:

This is a non-normalized form of approximation of the posterior distribution. Here, the first factor f(θₖ*) is just a constant related to θₖ, it is of no interest to us now. The second factor

can be reduced to the form exponential factor:

from the multivariate normal distribution density formula:

In order to do this, we have to replace

Using this analogy and knowing the normalizing constant for the exponential factor from the normal distribution density formula, this is the rest of the formula, i. e.,

we can also find a normalizing constant for f(θₖ) :

As a result, we get:

Thus, we have obtained a formula for recalculating the posterior distribution.

Implementation notes and our results

We applied the method of multi-armed contextual bandits with Laplacian approximation and Thompson sampling to promote games and stickers on the second tab of our platform.

In both cases, the recommended items were considered as arms of the multi-armed bandits.

Decision-making context

The context included the characteristics of the user: gender, age and any features of their interests.

Let’s take a closer look at the interests. On the one hand, they are extremely important, since they characterize the user in a most evident way. On the other hand, there are certain restrictions to their use.

First, the features must be quite small in size, because a matrix with a side length equal to the number of features should be inverted in the formula. Secondly, the topics of these features must be encoded explicitly, without any intricacies that are characteristic of embeddings, because they are fed to ordinary logistic regression instead of a neural network.

As an option, it is possible to carry out a thematic modeling of the user, when the user is a document, and the groups he is subscribed to are words. But we have taken a different approach.

We had high-quality embeddings of communities obtained over the course of solving the problem of their recommendation and describing their topics in detail. We clustered communities in this embedding space using an adaptive EM algorithm with a limited number of clusters. The clusters obtained were quite well-interpreted. For instance, the examples below belong to the IT and popular science cluster.

Next, we laid out the user according to their clusters of groups, and this layout was used as features of the user’s interests.

What are positive and negative interactions?

Recommendations for stickers and games in mobile apps are shown to the user as a horizontal scroll widget on the Hub tab.

When it comes to stickers, the predictors for a positive signal are viewing all the stickers in the pack and buying the pack.

For games, a tap on the game, followed by spending a significant amount of time in it (to filter out clickbait or misclicks) and adding this game to favorites can be seen as positive signals.

Negative signals are more interesting, since the user usually does not openly show them. But it’s wrong to assume that all the recommendations that the user did not interact with are unsuccessful, because they could have simply been missed (which is easy in a scroll widget that does not initially show all objects).

We decided to consider interactions negative in cases when the user definitely saw the recommended object and deliberately did not show interest in it. In order to do this, we take all objects in horizontal scroll widgets that are located to the left of the object clicked by the user as examples of negative interaction. In the picture below, such examples of negative interactions are “Dryn” and “Donny”.

Different signal values

Naturally, not all positive actions are equal. For example, viewing a set of stickers is not the same as buying it. At the same time, negative interactions in our formulation do not give as obvious a signal as positive ones do. It would be preferable to address both kinds of actions within the same model. We solved this problem by assigning weights to certain types of signals. In the likelihood formula, the factors are raised to a power equal to the weight of the corresponding signal:

Quasi-random sampling

Experiments have shown that new objects, prior to equal involvement in Thompson sampling, should first be recommended quasi-randomly.

This means that we don’t just randomly choose the moment to put a new object in the recommendations, but we do it in a way that allows it to be recommended in as many different contexts as possible. This helps us understand where this object will be in demand, and give it a good start.

There are many types of quasi-random sampling. The most popular, perhaps, is quasi-random sampling based on the Sobol sequence. As for us, we use a sampler we created ourselves.

Results

First of all, we tested our algorithm on Android games. We compared the classic ML recommendations mixed with bandit recommendations and pure classic ML recommendations using A/B testing.

At first, the metrics in the test group fell slightly compared to the control group, but after two weeks they began to steadily grow. Two months later, the effect amounted to +8% of game installs, +2% of launches, after which we applied the new recommendation system to the entire audience. As for stickers, the revenue from their sales increased by +5% compared to classic ML recommendations.

In both cases, more items — namely games and stickers — began to gain users’ attention.

Now we are working on implementing this approach to the recommendations of other items. We are also developing new approaches for cases where feedback does not come immediately (reinforcement learning solutions) and when we don’t want to be limited by a small number of factors, but instead use all the available information (deep learning modifications such as those described in this article).

You can read more about other ML-related things we do at our company by looking up more blog posts by us. For example, take a look at our articles “Practitioner’s Guide to Statistical Tests” and “PYMK at VK: ML over EGO-NETS”.