Data Debugging in Collaborative Filtering with Implicit Feedback: Does It Work?

8 min readMay 27, 2021

Hi, my name is Daniel. I’m a member of the Core ML Team at VK.com. I work on recommender systems, which is also the topic of this post.

An interesting paper was presented at NeurIPS 2020: https://papers.nips.cc/paper/2020/hash/019fa4fdf1c04cf73ba25aa2223769cd-Abstract.html.

The main idea of it is that if you somehow preprocess the training data for your matrix-factorization based recommender model, it can yield significant metric improvements on the test set. For example, here are the most impressive results from the paper’s supplementary material:

The above table shows that if you modify 10% of user’s ratings from movielens in the training data, you can get roughly +100% increase in HitRate@5 and nDCG@5 on the test data (Original CF stands for original collaborative filtering model without modifying the dataset; columns show the percentage of modified ratings from the training data). These results look very promising. The reproducible code is available here: https://github.com/SoftWiser-group/CFDebug. Today, let’s dig into it and find out if we can use this method to boost our metrics.

The overall idea can be summarized as follows: “Are all ratings helpful in collaborative filtering, and if not, how can we mitigate harmful (i.e., overly personalized) ones to improve the overall recommendation accuracy?” To accomplish this, we split the training data several times into training and validation sets (in out-of-fold fashion) and try to figure out which ratings from the training set have a negative effect on the value of the loss function on the validation set. Then, we pick the top-k% of the training data ratings with the highest negative effect on the loss and either delete or modify them. Here, by modification we mean changing the rating towards the anti-gradient of the loss wrt the ratings. The results of such modifications are presented in the table above. Note that no cheating took place here, as the authors did not use the test set when identifying ratings to modify or remove.

Implicit vs Explicit Feedback

It’s worth noting that in this work, the authors used “explicit ALS”, a model that tries to predict the rating (1, 2, 3, 4 or 5) directly. In practice, however, implicit ratings are much more common. We use implicit ALS a lot at VK, and I can hardly remember any cases when we used explicit ALS. This means that before we do anything else, we first have to adapt the algorithm to work with implicit ALS. I used an implementation of the implicit ALS model, which can be found here: https://github.com/benfred/implicit.

My implementation of implicit feedback support for the paper’s code is available here: https://github.com/Danila89/CFDebug.

Implicit ALS has the following loss function:

where u are user embeddings, v are item embeddings, lambda is the regularization factor, p is binary preference (we say that it’s 1 when the rating is 4 or 5) and c is confidence (c = 1 + alpha * p, alpha — hyperparameter).

While the paper’s authors were trying to modify ratings, we will try to modify confidences, as they are the weights, and reducing the weight of “noisy” observations seems like a straightforward approach to the problem.

We need to compute the gradient of the loss on the validation set (L_gamma) with respect to the confidence matrix of the training dataset (C):

Just as the paper’s authors did, we will use the chain rule. First, we present the required gradient as the product of the gradient of the parameters of the model (theta) wrt C and of loss on validation wrt theta. Then, while our model is just a product of embeddings, we obtain a sum of two products (in the formula, U stands for user embeddings and V for item embeddings). The derivatives of L_gamma wrt U and V are simple to calculate (gamma denotes the validation dataset):

Just as was outlined in the paper, to calculate the derivatives of U and V wrt c_i_j, we will resort to the KKT conditions (refer to [3, 4] for additional details). First, we need to get U in terms of V, C, P, and V in terms of U, C, P:

where C^{u} is a diagonal n*n matrix (n — number of items) with elements in the (i, i) position equal to c_u_i, and p_u is the vector of a user’s binary preferences,

and lambda is the regularization constant.

The above formulas are the expressions for one iteration of the implicit ALS algorithm (refer to [1] for additional details).

Now we will differentiate these formulas wrt c_i_j:

Finally, we have all the derivatives needed to compute the required gradient. Note that in the code I use np.einsum to compute the d(u_i)/d(c_u) matrix (num items for user i * embedding size) instead of the d(u_i)/d(c_i_j) vector. The d(u_i)/d(c_i_j) version, as well as the correctness check of the gradients, can be found in the test_gradients.py module.

Modifying the Training Dataset

We will refer to the most harmful ratings as maverick. In case of explicit feedback, we’re going to do gradient descent wrt ratings, find the ones with the maximum absolute change, and call them maverick. In case of implicit feedback, we have confidence instead of ratings. There is a notable difference between using explicit and implicit feedback: for a particular user-item pair, c_i_j increasing during gradient descent means that p_i_j helps us decrease the loss. It is quite different from the explicit case, where an increase in rating means that the user should have rated the item higher. The question of whether or not we should let c_i_j grow during the maverick rating search (aka data debugging process) is quite tricky.

In the paper, the authors force the ratings to be in a range between 1 and 5 for explicit feedback. This makes it impossible to obtain a value above 5, even if it decreases the validation loss. In a similar fashion, we will force c_i_j to be in range (0, alpha * p_i_j) during gradient descent. Also, following the original paper, we will change c_i_j only if p_i_j=1.

Data and Metrics

I used the movielens dataset, the same as the one used in the paper. I referred to ratings 4 and 5 as relevant. For the metrics, I used auc-roc and precision@10, and for the hyperparameters, I used alpha=5, lambda=0.1 and embedding size 10.

Experimental Results: Explicit Feedback

As a first step, I reproduced the results from this table:

These are the results for explicit ALS in terms of RMSE, • sign means statistically significant change (p-value < 0.01).

Here are the results wrt ranking metrics:

The results are quite similar to the ones obtained in the original paper. For the explicit feedback, data debugging provided us with a significant boost of ranking metrics on the test dataset.

Experimental Results: Implicit Feedback

For implicit feedback, we used weighted RMSE (according to the loss function of the model). As you can see, weighted RMSE decreases when modifying training dataset.

But the problem is that ranking metrics (auc-roc, precision@10) do not increase. In some cases, they may even decrease.

Unfortunately, these results suggest that we can’t use this method in production to obtain a significant metrics boost.

Discussion

As can be seen from our results, the performance of implicit ALS in terms of ranking is much better than performance of explicit ALS, even given training data modification. That’s probably the reason why the results with explicit ALS are less interesting from a practical point of view. Unfortunately, in the case of implicit ALS, our experiments with the proposed approach did not lead to improvement in terms of ranking metrics.

At the moment, I can point out two directions to explore in future work:

Change the loss function. Notice that weighted RMSE (the model’s objective) decreased significantly after dropping or modifying the maverick confidences (weights of maverick user-item interactions). So the main problem with the proposed approach in case of implicit ALS is that the ranking metrics don’t improve while the model’s objective does. Probably if we would use ranking loss (e.g. BPR [5] or WARP [6]) the data debugging will work better. But unfortunately the proposed approach is not straight-forward applicable to the ranking loss functions.
Try larger datasets. In practice, no one trains recommender systems using only 0.8 * 1 million of user-item pairs (actually even 0.8 * 575 thousands after binarization). But I believe that around 500 thousand user-item pairs should be enough to see the effect.

References

[1] Y.F. Hu, Y. Koren, and C. Volinsky, “Collaborative Filtering for Implicit Feedback Datasets”

[2] L. Chen, Y. Yao, F. Xu, M. Xu, H. Tong, “Trading Personalization for Accuracy: Data Debugging in Collaborative Filtering”

[3] Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychi, “Data poisoning attacks on factorization-based collaborative filtering”

[4] Shike Mei and Xiaojin Zhu, “Using machine teaching to identify optimal training-set attacks on machine learners”

[5] Rendle, Steffen, et al. “BPR: Bayesian personalized ranking from implicit feedback”

[6] Weston, Jason, Samy Bengio, and Nicolas Usunier. “Wsabie: Scaling up to large vocabulary image annotation”