A Little Journey of Raw Events: From Logs to Experimental Results

24 min readAug 12, 2021

Introduction

In recent years, there have been many papers on A/B experiments published. The vast majority of them are about statistical tests and their usage limitations. However, in this article, we are going to look at the entire process of such experiments, including data pipelines, specific data structures, and statistical procedures.

We believe that well-chosen data transformation and its structure can lead not only to improved performance but also to desirable statistical distribution properties (e.g. symmetry, normality). Thus, our approach might be very useful in cases where dozens of experiments are conducted simultaneously with a large number of metrics and billions of raw events.

In our case, the system is processing 20 billion raw events and delivering 100 thousand experimental results per day, where each experimental result is a comparison of the control and test groups for a specific metric in a certain experiment.

Our goal in this article is to describe our approach to computing experimental results with a strong focus on computational performance. In our approach, we rely on ClickHouse, but there are many ideas that can be easily exploited on a different DBMS.

The paper is organised as follows:

In the next section, we are going to briefly discuss data storage formats that provide flexibility in managing metrics.
Then, we will consider ClickHouse features, such as materialized views and specialized tables’ engines, which we actively use for processing raw data.
Next, we will say a few words about bucket transformation and the reliability of large-scale data processing.
The final section will be devoted to summarizing the results of experiments and our plans for improving our processes.

In this article, we will be describing our approaches to:

preparing data for computing A/B results
statistical computations themselves (tips and tricks)

The above picture is a general overview of our processes. Now, let’s dive into the details.

Logs and data aggregators used at VK

Before we dive deeper into explaining how to improve the efficiency of computing experimental results, we would like to give a short overview of how we store data using ClickHouse.

As a whole, ClickHouse is “blazing fast” and “easy-to-use”¹, which are only some of the advantages it has for logging data that came in handy at VK². But when it comes to actually analyzing data and making some business decisions based on the available info, we have to aggregate data to achieve the following two goals:

Additional speed boost. VK users generate so much data that it is reasonable to use data aggregation even when using ClickHouse, since data analysts and machine learning scientists at VK do a number of repetitive tasks every day.
Information security compliance. There are several security compliance benefits, but since it is not the subject of this article, we will only note that this allows us to enhance our data management capabilities, i.e. mitigate risks and data security breaches.

In general, we have 2 types of data sources:

Raw data logs that we collect from tens of thousands of servers
Derived aggregated data tables

There is nothing special about raw data logs, which contain one row for each occurred event, such as playing audio or making payments using VK Pay.

As for aggregated data tables, we use two basic tabular data presentation formats: row- and column-wise tables. We’ll consider the advantages and disadvantages of each of the formats in detail in the next paragraph. But in general, column-wise tables are useful for when you need to make a calculation using different columns. This is a fairly beneficial format for both analytics and data visualization. However, it also has its drawbacks.

The row-wise table format, however, is more flexible, allowing additional variables to be included and also makes it possible to add in additional measurement occasions. At VK, we use the row-wise (or narrow) table format for quite specific cases, such as to serve as the data source for the dashboard for product team meetings. There are more than 100 diagrams and counting, so it is really important to be able to quickly add new metrics.

Let’s delve into the details of these table formats.

Сolumn-wise and row-wise tables

Column-wise format

When it comes to column-oriented databases, it’s a good idea to have wide tables with dozens of columns. Queries read only required columns, which optimizes disk utilization and execution time. Our first approach to collecting experimental metrics was based on a column-wise table. Let’s call it the Metrics table.

The Metrics table’s rows are experimentation units (user IDs or something else), each column is metric. Thus, the size of the table computes as follows:

number of experimentations units * number of metrics

At the peak of this approach, the Metrics table was extremely large: more than one thousand columns and around sixty millions of rows per day. Here’s what it looked like:

An attentive reader might notice that we’re storing units’ experiments and groups in arrays with the same order. For example, the unit with the number 5637423 is participating in experiment #56 in the second group (experiment and group share the same index in arrays).

The main feature of such a table is obvious: you have a human-readable feature database of your users. It was very useful for ad-hoc analysis and product research.

But its drawbacks are also significant, especially for a growing experimentation system covering several products.

First of all, adding new metrics is very expensive. You need to alter (add columns to) the Metrics table and then patch query with thousand columns (the SQL file for such a query usually has hundreds of rows):

SELECT experimentation_unit_id,
       AGG(log1.any_column_1) AS my awesome metrics #1,
       AGG(log1.any_column_2) AS my awesome metrics #2,
       …,
       AGG(logk.any_column_k) AS my awesome metrics #k
FROM my_awesome_log_1 AS log1
LEFT JOIN my_awesome_log_2 AS log2
       ON log1.experimentation_unit_id = log2.experimentation_unit_id
LEFT JOIN
       my_awesome_log_3 AS log3
       ON log1.experimentation_unit_id = log3.experimentation_unit_id
       …
LEFT JOIN
       my_awesome_log_k AS logk
       ON log1.experimentation_unit_id = logk.experimentation_unit_id
GROUP BY experimentation_unit_id

New metrics can be added as yet another left join in a query or as columns from an already used log. In most cases, this requires changes in data transformation pipelines.

The second problem, closely related to the first, appears when an experimentation unit has several dimensions of one metric. For instance, a user has two gadgets: an Android phone and an iPad. You have to store its metrics in two columns, one for each platform. In the long run, the table tends to increase the number of columns. For rare dimensions, the vast majority of values will be zeros. However, this does not directly lead to a decrease in performance in columnar storage but starts to require filtering conditions in queries.

The third problem is measuring accuracy. Accuracy is derived directly from highly aggregated data without distinguishing between the platform, application version and other cases specific to this experiment. If only a certain app version is involved in the experiment, then the sensitivity of the metrics can be less than what we’d prefer. But there’s two sides to this coin, as false positive results can appear in experiments due to random changes that are not related to the current experiment. A possible solution in the current paradigm is to add columns for the desired dimensions. But here, we run into the problems we listed above (#1 and #2).

Row-wise format

To solve the problems outlined above, we switched to a different data schema. First, we determined the properties that the observation (log entry) has:

owner (experimentation unit id)
name (what exactly is measured)
dimension (measured value)
type (additive or not)

Together, these properties define the schema of the new table:

As you can see in the table above, instead of the column-wise approach we used previously, we’re storing metrics in a row-wise table. With the new approach, adding a metric no longer requires adding a column. The second problem is automatically solved and the third one depends on the method used to fill the table. We’ll take a detailed look at our approach to filling the table in the next section.

Before moving on to the next section, we will briefly discuss the additional advantages of a row-wise table. Let’s say we want to calculate the average, standard deviation, and number of users based on column-wise table data in all experiments with one query (we need these statistics to calculate the t-test). This is what it looks like:

SELECT experiments,
    groups
    AVG(my awesome metrics #1) AS avg_metrics_#1,
    STD(my awesome metrics #1) AS std_metrics_#1,
    COUNT(DISTINCT experimentation unit identifier) AS N_#1,
    …,
    AVG(my awesome metrics #k) AS avg_metrics_#k,
    STD(my awesome metrics #1) AS std_metrics_#k,
    COUNT(DISTINCT experimentation unit identifier) AS N_#k
FROM the_metrics_table
    ARRAY JOIN experiments, groups
GROUP BY experiments, groups

*ARRAY JOIN statement expands an array into a row representation³

input:

output:

With a large number of metrics, the query turns into a giant enumeration of metrics. But when using a row-wise table, we don’t need to enumerate all required metrics:

SELECT experiments,
    groups,
    name,
    AVG(value) AS avg_metrics,
    STD(value) AS std_metrics,
               COUNT(DISTINCT experimentation unit identifier) AS N,
FROM awesome-row-table
 ARRAY JOIN experiments, groups
GROUP BY experiments, groups, name

We call such queries end-to-end. Thanks to this simplification, we don’t have to worry about new metrics being added, as they will always be processed correctly.

AggregatingMergeTree and MaterializedViews

We will call our row-wise table with experimentation units and metrics the Metrics table. Value column is a numeric column in the Metrics table used to store values of a particular experiment’s metrics.

Now, let’s discuss our approach to putting experimental metrics data into a row-wise format. We call this table the Metrics table. The schema of the Metrics table looks like this:

{
    `dt`               Date,
    `user_id`          Int32,   # this is an example of the experimentation unit identifier
    `metric`           LowCardinality(String),
    `metric_type`      LowCardinality(String),
    `value`            SimpleAggregateFunction(sum, Float64)
}

ClickHouse offers a convenient tool to fill in this table almost in real-time using a special MaterializedView engine⁴. This is basically an insert trigger⁵ that grabs a freshly inserted block of data data, manipulates it and pushes the result into the table. We additionally use a buffer engine table to buffer (after special conditions are met) then flush blocks of data into the Metrics table.

We use the MaterializedView engine to fill our Metrics table with data from log tables containing streaming raw data. We mentioned these tables at the beginning of this article.

This solution lets us add new metrics to experiments and edit existing ones very quickly. If we need to create a new metric for some experiments, we can just create a new MaterializedView and see new metrics in our Metrics table instantly.

The Metrics table is also worth paying special attention to. Note that MaterializedView is making computations only on a particular block of data that is being inserted. There are many of these blocks daily, which is why we need a way to aggregate this data inside the Metrics table in order to avoid problems with duplication.

Fortunately, ClickHouse provides us with the AggregatingMergeTree⁶ engine. This type of engine processes the Value column in the Metrics table with a special data type, SimpleAggregateFunction (you can see this column type in the Metrics table schema above). For additive metrics, it looks like SimpleAggregateFunction(sum, Float64).

It is basically the rule for merging data parts in the background. ClickHouse replaces all rows with the same primary key with a single row, and the Value column is aggregated by the function into SimpleAggregateFunction. This background calculation is a nice property of the AggregatingMergeTree engine.

To sum up, we use insert triggers (MaterializedViews) to put fresh data into our Metrics table. And this table utilizes features of the AggregatingMergeTree engine. This allows us to delegate almost all routine jobs to ClickHouse.

Bucketization

We also perform bucketization on our Metrics table. This means that we group users into “metausers” (buckets) and do statistical calculations on these buckets.

There are several reasons for using this transformation:

We significantly lower the data size for statistical calculations.
Bucketized data often has more of a symmetric distribution of the Value column compared to the initial dataset⁷.

These two factors give us a dataset with good properties for statistical tests and linear models, which we will cover later. In addition, this also increases computation speed.

All data flows are in the picture below:

We perform bucketization on the stage between the Metrics table and the Buckets table. As you can see, after the Buckets table there is a Cumulative table. This table is just an aggregated version of the Buckets table, but without a dt column, which is possible because we use this pipeline only for additive metrics.

Next, we perform t-test calculations on the Cumulative table and write results into the A/B results table. This means that every day we use data starting from the beginning of a particular experiment.

If something goes wrong, we can use the Buckets table to redo the calculations.

Buckets and Cumulative tables have an is_history field, which is additional information that shows if a record is related to an experiment’s time period (is_history = 0) or to a time before an experiment. We use this field to assess the effect of splitting populations into experimental groups. We will cover this in more detail later in the article.

Additive and ratio metrics

In previous sections, we mentioned AggregatingMergeTree and SimpleAggregateFunction(sum, Float64) for additive metrics. We use more than just the additive metrics type, since our analytics teams want to launch experiments that target mainly the ratio-metrics type. For example, our advertising platform might want to increase CTR (click-through rates), for which they don’t need additive metrics.

In order to include ratio metrics into A/B experiment results, we decided to create these metrics only from existing additive metrics. Then we simply add the resulting data to additive metrics data and apply the necessary statistical procedures to this dataset.

Reliability

We briefly mentioned earlier that it is possible to quickly restore or fix corrupted data for some experiments with this data flow. Let’s look into this in more detail.

Usually, mistakes can happen in one of these stages:

When collecting data into the metrics tables using materialized views
During bucketization into the buckets table

Issues with MaterializedView or bucketization

The Metrics table is a destination table for materialized views. But what if a materialized view has a mistake in an underlying query? In this case, we have to drop this view and create a correct one.

In order to fix the mistake in the next table, we have to:

delete the data generated by this view in the Metrics table.
determine which experiments contain metrics from this materialized view and delete these rows from the buckets table. Sometimes, it is easier to simply delete all of the corrupted experiments’ data from the Buckets table.
delete all corrupted experiments’ data from the Cumulative table.
delete results of A/B statistical calculations from the A/B results table.

After all these deletions, we have to recollect the data:

Collect data starting only from the moment of recreation of the corrupted view.
Collect necessary aggregates based on the Metrics table, the Buckets table, and the Cumulative table.
Compute the experiments’ results simultaneously with the aggregation of the Cumulative table.

This way of restoring data is also applicable if we need to add new metrics to an already finished experiment, such as if someone forgot to add these metrics. When this happens, we have to create new materialized views or fix the existing ones and go through all the steps previously described.

Sometimes, we can fail while computing buckets from our user metrics, but the recovery procedure is very similar to the one described above.

Previous attempts to ensure reliability

As was mentioned in the previous parts, we tried to use the column-wise format of the Metrics table. When we used this approach, our data went from the Metrics table (each row represented a particular experimentation unit) straight into the Cumulative table.

This process could take several hours, so we decided to dump the Cumulative table into a special Dump table after successful aggregation.

This Dump table could come in handy if aggregation to the Cumulative table failed. In such cases, we were able to simply restore the previous version of the Cumulative table from our Dump table.

This approach is dated now, but it still might be useful in some situations, so it is worth mentioning.

Aggregation tool

We have a tool designed specifically for different aggregation tasks. This aggregation tool can do just about everything we need:

Aggregate data into distributed tables on different ClickHouse clusters using a SQL script and a config for a particular cluster.
Run aggregation into buckets for A/B experiments.
Run statistical computations on the Cumulative table.
Deal with serious issues on clusters (switch between replicas if the current one is dead, retry failed operations, send notifications in case of failure, log operations performance into the ClickHouse table, etc.).
A number of minor service tasks.

When something goes wrong with ETL processes, we have this tool as a backup and, therefore, encounter almost no problems when fixing things.

Computing experimental results

After we get all the necessary data into the Cumulative table, we are ready to compute statistics for each experiment. We created a special tool for this purpose in Python, which we will call the abtool. At the moment, there are metrics for each experiment, grouped by experiment group and by bucket.

The question is whether the metric has changed significantly in the test group compared to the control one. To answer this question, we created a tool for calculating the results of the experiment. In this section, we will consider its evolution from a vectorized t-test to linear regressions. The central feature of the tool is the concept of a block. A block is an atomically processed piece of data. We will discuss all of this in detail below.

Vectorized Welch t-test

In previous versions of the abtool, we used Welch’s t-test to compare the average values of metrics across buckets. Despite the fact that this version is no longer used, it will be useful for understanding the concept of the block and vectorization of calculations.

At first, we get a table with the experiment id, metric name, group and bucket from ClickHouse as an input of the abtool. For each experiment and each pair of groups, we must perform a t-test for each metric, and the arrays of metric values in test and in control are compared.

The length of these arrays is the number of buckets. The number of buckets into which we divide users is fixed to 256. However, the length of the arrays may differ, since some metrics may not be determined for all users and will therefore be missing for some buckets. However, most of the metrics are defined for all buckets.

The abtool was written in Python and is capable of interacting with ClickHouse. Since it processes dozens of experiments every day, each consisting of five pairs of groups on average⁸, having several hundred metrics, where each metric has 256 buckets, we had to optimize it to ensure adequate runtime. The process goes as follows:

We get the data from the Cumulative table and iterate over it by each experiment_id. Then we determine pairs of groups (control and test groups) for each experiment. For example, in the picture above, we have an experiment with 4 groups (A, B, C and D), with group A as the control one.

We should also briefly describe the Block abstraction. The picture below represents the essence of each block:

The Block is a combination of metadata, data for computation and methods for transforming and combining metadata and data. In order to create Blocks, we perform the following data transformations:

For a fixed experiment and a pair of groups, we calculate the length of the array of the first and second group for all metrics.

2. Next, we group metrics by these two calculated parameters and concatenate metric arrays to matrices. Each row of these matrices consists of metric values for all buckets. The number of rows is the number of metrics with a fixed length of arrays in test and control, and these lengths may be different.

3. Using numpy and scipy.stats, we can implement a vectorized Welch t-test to each Block. This way we can process multiple metrics of the same length at once, and this was an important feature of our computational approach.

Let us explain the main idea of the algorithm using an example. To calculate the average values of all metrics, we can take the average in the matrix along the horizontal axis and get a vector of averages. The metric’s schema would look like this:

Thanks to numpy, this operation is several times faster than performing similar actions in a loop. All other vector computations required for t-test can be implemented in a similar way. As a result, we vectorially obtain the metrics’ average values by buckets as well as the significance p-value of the change (it is often assumed that if the p-value < 0.05, then the change in a metric is significant). The confidence interval for the effect (effect = test mean — control mean), and minimum detectable effect⁹ (mde) are also calculated. If 0 does not belong to the confidence interval for the effect, it is indicative of the effect’s significance. If the effect is larger than mde, this confirms the significance of the changes even further.

Linear regression in A/B

Interpreting statistical tests

We can generalize the approach from the previous part. T-test and other common statistical tests are linear models¹⁰. Let’s take two sample t-tests equivalent to model

where exp is a vector of metric values (concatenate test and control), G_2 is a vector of groups taking values 0 and 1 for control and test, respectively. This way, we get a model in which the metric values y depend linearly on the group G_2 and a constant.

Welch t-test is a generalization of two sample t-tests for cases of unequal variations and is equivalent to the generalized linear model

with special weights of variables.

One of the advantages of this approach is that it makes it possible to explore the results in more detail by applying methods used in linear models. For a “well”-trained model:

the model coefficient for group G_2 is interpreted as an effect (average in test minus average in control).
the coefficient for a constant is average in control.
the p-value of the coefficient for group G is equivalent to the p-value of t-test (the significance of the effect).
confidence intervals for coefficients of the model are also calculated. For the group G_2 coefficient, it is interpreted as a confidence interval for the effect.

Additionally, you can evaluate the quality of the resulting linear model:

(Adjusted) R squared will show how well the model explains variance. Although, low R squared values do not yet mean that this model cannot be used to verify the effect’s significance.
F-test p-value of overall significance can show that your regression model fits the data better than the model with no independent variables.
Durbin Watson statistics can be used to check autocorrelation in linear regression residuals.
Jarque-Bera test checks whether residuals are normally distributed.

We collect these statistics to research different properties of experimental metrics in the future and to apply different, more efficient models to some metrics. For example, when it comes to residuals, distribution is different from normal (and probably low r squared), which is why it may be worth using nonparametric tests or Probabilistic index models. This topic is currently being researched by our team and, if proven successful, will be applied in the next versions of the abtool. Currently. about 30% of the results have residuals, so the potential for improvement is quite large. See the part about nonparametric regression for more details.

Pre-experimental data

Another advantage of linear regression is that it makes it possible to add additional parameters (covariates) to the model. Pre-experimental data can be considered to be an additional variable.

There are metric values for the same users from the experiment, collected over a certain time interval (for example, a week) before the experiment. In theory, the metrics should not differ significantly for the control and test groups, since the experiment has not yet been launched. Metrics are collected for the same buckets.

A linear model

is constructed. Here exp is the metrics’ values of the experiment, pre is the metrics’ values before the experiment (pre-experimental data), G_2 is the group (0 or 1), similar to the previous model.

The actual effect can then be separated into treatment effect and selection bias (the effect of partitioning on groups). Practice shows that in big data for random partitioning, some metrics can be distributed unevenly across groups, since hundreds of different metrics are computed simultaneously.

In the picture below, we can see the example of such uneven partitioning — the groups were different before the start of the experiment, so it is very much possible that experimental results are driven just by this effect.

A good split may look like one in the picture below:

Before the experiment started, the test and control groups had similar levels of some target metrics. After we introduced our experimental feature, the metric in the test group started to grow in comparison with the control group.

Using pre-experimental data helps eliminate selection bias and evaluate the treatment effect. For a “well”-trained model:

the coefficient for group G_2 is responsible for the treatment effect of the experiment.
the coefficient for the pre-experimental data is responsible for the effect of partitioning.

Together, they give the actual effect. The coefficient at a constant corresponds to the average of control. Effect significance is shown by the group G_2 coefficient p-value.

Another reason to use pre-experimental data is to reduce variance. The variance of (exp — pre * coef_pre) is reduced compared to the variance exp. According to our data, for all metrics and experiments, the decrease amounted to 20%. We also found that using 512 buckets instead of 256 does not lead to any significant changes that are worth increasing the number of calculations for.

Blocks in a linear regression framework

How is the block concept used in a linear regression framework? In such cases, one metric is atomically processed in a single comparison. Thus, the block is a comparison of the test and control groups for a specific metric with all the necessary metadata and methods.

Parallelism is achieved by using multiple processes. However, it is possible to combine several regressions (comparisons) into one block (using a diagonal matrix) and use the seemingly unrelated regression¹¹ (SUR) approach.

Generally speaking, we do not need to take into account the covariance between random errors in different equations. In the picture below, there is a simple example of a possible SUR configuration.

X are pre-experimental and experimental values of different metrics (likes, stories, calls). Each of z_metric_name / x_metric_name is a vector with length = number of buckets.
B is a matrix of vectors:
• first column = estimates of average control group metrics value (a).
• second column = splitting effect (PRE).
• third column = feature or experimental effect (G).
E are the residuals of the model.

The idea was to apply our vectorization approach in order to get estimates really fast. However, using the current SUR implementations in Python, we did not get an increase in performance (one process case)¹², so this approach was abandoned since the parallel implementation in the loop also showed adequate time.

For the purpose of optimization, replacing part of the pandas code with numpy code worked. For example, concatenation in numpy is faster than adding columns in pandas. Thus, one of the critical sections running tens of thousands of times was accelerated x3 (in a sequential version).

Nonparametric regression

In practice, the bucket transformation does not always ensure the normality of the resulting distribution. As mentioned above, our observations show that only about 30% of the metrics in any experiment are different from the normal distribution after the described transformation (according to the Jarque-Bera test). This means that the power of the tests is significantly lower than what we would expect using a nonparametric criterion, such as the Mann-Whitney test:

In the above picture, we can see the ECDF of p-values for the t-test and the Mann-Whitney test. Based on the data, it can be seen that the latter is 22% more powerful in the case of askewed distribution with a relatively small number of observations (1K) and a given 5% uplift in the test group.

Is it possible to replace the Mann-Whitney test with an appropriate linear model? Absolutely. There are two options for it: regression with the rank instead of the target value, and a probabilistic index model. The first option is simple:

signed_rank(Y) = a + b*X

But the second one is more sophisticated and promising. The probability index reflects the general hypothesis underlying the Mann-Whitney test: the probability that a randomly selected object from one sample will be greater than an object from another sample:

And probabilistic index model is special form of Generalized Linear Model:

where m is a link function and X, X’ covariates and Y, Y` targets in control and test groups respectively. According to the paper¹³, probabilistic index models outperform ranked regressions in terms of model quality and test power. We are currently investigating PIMs on our data and will publish the results as soon as we achieve meaningful findings.

Network effects

Sometimes, we conduct our experiments on special features that involve interaction between our users. In these cases, we should always take into account the so-called spillover effect (also known as the network effect). The spillover effect is when users in a control group (not affected by an experimental new feature) are communicating with treated users, which may have an effect on the users in the control group.

In the picture below, we can see that our users (initial experimental units) interact with each other very often, which can affect experimental results.

We are planning to continue using our approach with linear models to estimate experimental effects. In order to estimate the effect of users’ interaction, we will expand our linear model with a new factor: the number of links with other treated users. This idea is based on an approach described in another paper¹⁴.

First, we plan to construct an adjacency matrix that represents links between our users, which could consist of friendships, direct messages, etc. Then, we will compute the number of each user’s treated friends in a particular experiment. And finally, this user-level data will be converted into buckets.

The main challenge is to compute this new component quickly and correctly, taking into account the number of users we have and the way we store this user-level data.

After we define the ETL pipeline for this task, we plan to research the performance of our new linear model. It is possible that we will see some unexpected effects or some corner cases.

Dependent observation

In addition to the network effect, there is a more general case of dependent observations. They can be determined both by the presence of edges between the vertices of the social graph, and by hidden variables that determine behavior (the so-called homophiliс or socio-demographic characteristics of the audience). Audience groups can be distinguished using clustering, and the dependence between observations within groups can be taken into account using generalized estimation equations or mixed models. In the first case, the complexity is the correct choice of the correlation structure, and in the second case, we need the configuration of random effects. In addition, the algorithm of transition from clustered users to buckets presents certain difficulties.

Conclusion

Our efforts have significantly improved the performance and flexibility of our experimentation platform. We achieved a speed increase by about 4–5 times compared to the original version. Now, adding new metrics takes about 15 minutes instead of 30 to 40. But the most important thing is that we have started using linear regression, which is an extensible framework with which you can test hypotheses in a variety of different situations, ranging from nonparametric criteria to exotic zero-inflated models. In addition, we can now also use covariates that increase the power of tests.

¹ Why ClickHouse might be the right choice?

² How VK inserts data into ClickHouse from tens of thousands of servers

³ https://clickhouse.tech/docs/en/sql-reference/statements/select/array-join/

⁴ https://clickhouse.tech/docs/en/sql-reference/statements/create/view/#materialized

⁵ https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf

⁶ https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree/

⁷ As it follows from the central limit theorem

⁸ We also use an additional random split for each group (‘A’, ‘B’, ‘C’, …) into two parts, and compare, for example, ‘AB’ (1 and 2 groups are ‘A’, 3 and 4 are ‘B’) pairs of groups (1, 3), (1, 4), (2, 3), (2, 4), and then we average all the results. This allows individual outliers to be smoothed out and also makes it possible to do comparisons (1, 2) (‘AA’). If the differences in ‘AA’ comparison are significant, then the metric is unevenly distributed relative to the given partition and the results of ‘AB’ comparisons may be incorrect.

⁹ https://www.mdrc.org/sites/default/files/full_533.pdf

¹⁰ https://lindeloev.github.io/tests-as-linear/

¹¹ https://en.wikipedia.org/wiki/Seemingly_unrelated_regressions

¹² We may return to this approach in the future.

¹³ https://biblio.ugent.be/publication/5855936/file/5855955.pdf

¹⁴ http://proceedings.mlr.press/v28/toulis13.pdf