Causal inference pipeline for VK metrics

10 min readSep 11, 2020

Hi, everyone. I’m Anver from the CoreML team at VK, where I design and improve algorithms for news feed ranking.

Improving ranking algorithms usually results in improving the click-through rates(CTRs) of users’ actions (e.g., CTR of likes, CTR of comments, etc.). We will refer to these metrics as short-term metrics.

But there is another set of metrics, which include time spent and number of sessions per user. These metrics are considered to be far more significant, since they’re believed to be much better measures of users’ satisfaction with VK.We will refer to these metrics as long-term metrics.

Optimizing long-term metrics directly with ranking algorithms is tricky, whereas short-term metrics are much easier to optimize. If we knew the causal relationships between short-term and long-term metrics, we could focus on optimizing short-term metrics in order to increase long-term ones. In this work, we’ll try to infer causal relationships between short-term and long-term metrics. This work was done as part of my internship at ITMO’s Computer Technologies laboratory, in collaboration with VK.

Links on the project repository

The code to reproduce my results is available here: AnverK

I used a dataset containing the results of more than 6,000 A/B tests run by VK. The dataset is open-source, and you can also find it in my repository.

Link to the sandbox with synthetic examples.

Link to the sandbox with real data examples.

Overcome spurious correlations

It’s easy to assume that simply knowing the correlation between metrics is enough. Unfortunately, correlation does not imply causation.

Let’s assume that we measure four metrics with a causal relationship between them, as shown in the picture below:

Without loss of generality, we can assume that there is a positive influence in the direction of the edges. For instance, the more likes a user leaves, the higher the sessions per user (SPU). In this case, we can assume that an increase in the number of comments on a photo positively affects SPU. Taking this into account, we will try to increase this metric,expecting an increase in SPU.

When the correlation coefficient between two variables is high enough without the presence of a causal relationship, it’s called a spurious correlation. This can occur for more than just the two effects of a common cause. From the same graph, we can assume that likes positively affect photo opening. Even based on this simple example, it’s clear that simple correlation analysis can lead to a large number of wrong conclusions.

Causal inference (CI) helps us extract causal relationships from observational data. In this work, we applied CI algorithms, implemented Python interfaces for them, and proposed modifications of classic algorithms that work better according to the testing performed.

Classic causal inference algorithms

I considered the following CI algorithms: PC (Peter and Clark), FCI (Fast Causal Inference) и FCI+ (more practical algorithm than FCI, but with the same theoretical guarantees). You can read about them in Causality (J. Pearl, 2009), Causation, Prediction, and Search (P. Spirtes et al., 2000), and Learning Sparse Causal Models is not NP-hard (T.Claassen et al., 2013).

For now, we’re going to explain the main difference between the PC algorithm and the other two. The PC algorithm suggests that we can observe all variables which affect at least two observed variables (this hypothesis is called Causal Sufficiency). FCI and FCI+ permit the presence of unobserved factors that affect observed variables. In this case, causal presentation is more real, as it allows the presence of hidden factors U₁, … U_k:

An efficient and flexible enough implementation of these algorithms is available in the pcalg library. However, the library is implemented in R (RCPP for the most complex computations), which is why I provided wrappers for the methods from this library in Python. In addition, I also wrote some examples for how these wrappers can be used.

I’ll describe the interface of the wrapper and what it produces as output. Users can pass functions for testing forconditional independence. This test returns a p-value with the null hypothesis that variable X and Y are conditionally independent given a set of variables Z. By default, a conditional independence test based on partial correlation is used, since this test is default in pcalg and implemented in RCPP, which makes it very efficient. You can also pass the p-value, which will be treated as a threshold for the test. If you don’t need logging, you can specify the number of CPU cores forthe PC and FCI algorithms. There’s no such option for the FCI+ algorithm, but I still recommend using it over FCI, as it’s much faster. However, the proposed modification of edge orientation currently doesn’t work with FCI+ because of the limitations of pcalg.

All algorithms return PAG (partial ancestral graph) as a list of edges. In case of the PC algorithm, you should treat thegraph it outputs as a classic causal graph (or a Bayesian network), where a directed edge from A to B means the influence of A on B. Furthermore, an edge could be undirected or bidirected, which means that we couldn’t infer the exact direction.This can happen if there isn’t sufficient data for, or it’s impossible to determine the direction based solely on observable data. Yes, that means that sometimes you can only determine the class of equivalence of a true causal graph.

The results of the FCI algorithms are also PAG but with a new type of edge end: “o”. This type means that there could be an arrow or not. The main difference between the results of the FCI algorithms and the PC algorithms is that thepresence of a bidirected edge between A and B means that there’s a hidden factor U leading to A and B. Respectively, abidirected edge in terms of the PC algorithm now looks like an edge with “o” on both ends. A clear illustration can be found in the sandbox with synthetic examples.

Orientation algorithm modification

Classic methods have one significant drawback. Even if we assume that there can be hidden factors, these methods also require another strong assumption: an ideal function for testing conditional independence. Without it, these algorithms guarantee neither correctness nor completeness (which means that no more edges could be oriented), of the resulting graph. Do you know of any ideal tests on a finite amount of samples? Neither do I.

However, graph skeletons still look convincing in spite of this drawback. However, their edges are oriented too “aggressively”, which is why I proposed a modification to the existing algorithm of edge orientation. As a bonus, this modification enables the implicit regularization of the number of oriented edges. Unfortunately, any understandable explanation of this modification requires a very detailed description of causal inference algorithms. You can follow the link at the end of this article to find a description of that modification in my more theoretical article.

Compare models — 1: estimation of graph likelihood

One of the most complicated challenges in causal inference is, strangely enough, evaluating and comparing the models. That’s because, for real data, the actual causal presentation is usually unknown. This is especially true for the distribution of variables, which isn’t known precisely enough to sample data from it. This means that, for the majority of datasets, the ground truth remains unknown.

All of the above leads to a dilemma: we could try to use (semi-)synthetic data with a known ground truth or try to test on real data without knowing the ground truth. In my research, I tried both approaches.

The first of them was the estimation of a graph’s likelihood.

Here PA_G(X) is the set of parents of a variable X, I(X, Y) is the mutual information between variables X and Y, and H(X) is the entropy of a variable X. Because the second term doesn’t depend on the graph structure, people usually only calculate the first term. However, it’s important to keep in mind that the likelihood doesn’t decrease with new edges added, and it’s necessary to consider this during testing.

In addition, it’s crucial to know that this estimation only applies when comparing Bayesian networks (which are what the PC algorithm outputs) because in the actual PAGs (FCI, FCI+ algorithms’ output), the semantic of bidirected edges is toodifferent.

That’s why my proposed modification was compared with only the classic PC algorithm.

Note that, in comparison with the classic algorithm, the modified orientation of edges significantly increases the likelihood of the graph.

Now, let’s look at the number of edges:

As expected, the number of directed edges has decreased. This means that we can we can infer more plausible graph structures even with a fewer number of directed edges! You might say that, here, the results are a bit suspicious, aslikelihood shouldn’t decrease with the appearance of new edges. The reason for this is that the resulting graph is not a subgraph of the graph built by the PC algorithm (in the directed sense). For instance, bidirected edges could appear instead of regular ones, or edges could change their direction.

Compare models — 2: classification approach

Let’s go to the second comparison method. We’ll build a causal graph with the PC algorithm, and choose a random acyclic graph from the class of equivalence. After this, we will sample the data for every node as a linear combination(with coefficients from ±[0.2, 0.8]) of values from the parent nodes with additive Gaussian noise. This idea of sampling was taken from Towards Robust and Versatile Causal Discovery for Business Applications (Borboudakis et al., 2016). Nodes that don’t have parents are sampled from the normal distribution with the same coefficients as in the real data for that node.

We’ll apply algorithms that we want to evaluate to this sampled data. Since we already have a true causal graph, the onlything left is to figure out how to compare the graphs with a true graph. The authors of Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information (Affeldt et al., 2015) suggested classification terms I’ll be using here. We’ll assume that an edge between two variables is a Positive class, and lack of an edge is, respectively, Negative. So True Positive (TP) would stand for inferring a true edge, and False Positive (FP) would stand for inferring an edge that doesn’t exist in the true graph. These metrics will be evaluated for skeletons.

Since we need to consider direction, let’s add new classes. TP* would stand for edges that were inferred correctly but their direction is wrong. After that, we will calculate TP’ = TP — TP* and FP’ = FP + TP*. Having done this, we can calculate the F₁-score for both skeletons and graphs. Obviously, the graph’s F₁-score would be not greater than the skeleton’s. However, in the case of the PC algorithm, I suggested that the bidirected edge should add only 0.5 instead of 1 to the TP*, as at least one of the real edges was inferred (without Causal Sufficiency it doesn’t work that way).

Let’s compare the algorithms:

with the classic and modified algorithms of edge orientation. They are here to show the upper bound for the F₁-score. The other two results compare these two algorithms considering orientation. As you can see, there is no gain yet.

Compare models — 3: “disable” Causal Sufficiency

Now let’s “hide” some variables in the true causal graph and in the synthetic data after the sampling. This way, we’ll “disable” Causal Sufficiency. However, we’ll compare the results not with the initial graph (as it contains hidden variables) but with the modified initial graph.

For each hidden variable, we’ll add edges from its parents to its children as well as bidirected edges between its children. Now we can compare the FCI algorithms with classic and modified orientations.

And now, with Causal Sufficiency having failed, the result of the modified orientation becomes much better!

Another interesting observation is that, in practice, the skeletons built by the PC and the FCI algorithms are almost the same. This is why I compared the output of the PC and the FCI algorithms but with the proposed orientation.

We can see that these results are almost equal in terms of F₁-score. However, the PC algorithm is only one simple step in the FCI algorithm. Therefore, applying the proposed orientation algorithm to the skeleton from the PC algorithm is a decent solution to speed up causal inference.

Conclusion

What are the main takeaways from this article? After reading it, you should have a general idea of the following:

How IT companies can use Causal Inference to solve problems.
What a spurious correlation is and how it can affect causal inference.
Which existing causal inference algorithms are prevalent in practice.
Which difficulties can appear in practical causal inference tasks.
How to compare causal graphs.

If you found this article interesting, I recommend you to read my other article, which touches upon more theoretical aspects of the problem. There, you can learn the main Causal Inference terms. After reading that article, you’ll have a basic understanding of how these algorithms work, and what my proposed modification of edge orientation is. You can find my other article here.