Avoiding loss of money: notifications of drops in product metrics

11 min readJan 31, 2023

Trying to keep track of all different kinds of metrics and snapshots on dashboards, you can easily miss an important change in metrics indicating an issue. And you can lose audience or revenue if you fail to respond on time. Here we’ll tell you how we automated notifications of drops (or unhealthy surges) in product metrics so that we could estimate immediately any issue in terms of money, and how this contributed to the product. Our experience will be useful, first of all, for analysts and product managers.

Hi all! We are Ivan Yeremeyev, Head of Product Analytics, Sergey Nefedov and Tatyana Gaychenkova, Product Analysts, and we work at the business unit responsible for media strategy and development of services. Our team provides analytics for three products:

Relap.io — one of the largest native ad networks in RuNet;
Pulse — personal content recommendation service;
Media Projects — news and topics (Lady/Hitech/Auto/Cinema, etc.) on www.mail.ru home page and in News.Mail.ru mobile app.

Finding the best approach

We tasked ourselves with finding a comprehensive solution to the problem of losing focus when tracking values of metrics.

Of course, there are many notification-related solutions already existing, and we made sure we’ve reviewed available options before we try to invent something of this sort. For example, various commonly used IT and BI systems (such as Grafana, Redash, Power BI, etc.) have integral capabilities to report issues this way or another; however, they don’t offer enough flexibility to adjust all processes and metric notification conditions to our requirements and complexities, as well as to estimate cash losses in advance.

Having tried out off-the-shelf solutions we concluded that it makes sense to build an in-house automated notification system which would help track impacts of changes in product and respond to issues on time. Additionally, we came up with an idea to predict changes in metrics and prevent issues before they arise.

First, we asked ourselves:

What do we consider anomalies in our metrics?
When notifications of changes in metrics should be sent?
What is the optimum frequency to always capture critical deviations without being a nuisance to colleagues?

Having answered these questions, we began to try out different approaches to notifications one by one:

Now let’s talk about advantages and drawbacks of these approaches drawing on the example of a time series.

Time series is a series of values of a variable(s) recorded at certain time intervals (regular or irregular). Any time series can be divided into components:

Trend — function describing a surge or drop;
Seasonality — it deals with fluctuations in data over a year, a week, a day, etc.;
holidays or other events which may significantly affect behavior of a time series;
Noise — random disturbances.

Current vs. last period

We started with the simplest approach — current vs. last period comparison (for example, month on month, as on the graph): November vs. October, October vs. September, etc. Simplicity being its main advantage, this approach had one fatal drawback: it failed to consider normally the whole range of metric seasonality and trends. For example, the above graph considered for comparison a period corresponding to low season, when DAU metric is usually below common values for the business, and that very period was compared month on month with the period where high season already started. Naturally, the results could not be satisfactory with such an approach. We received false positive notifications too often, so we decided to try other options.

Moving average

Moving average is the average value of a metric for a selected period in the past. For example, we tried comparing current values of DAU with its averaged value for 30, 60 and 90 days: a notification was sent when the difference was big. We generally could smooth out minor metric fluctuations as well as determine seasonality and trend. However, this method’s high sensitivity led to false positive notifications of changes in metrics, so we had to reject it.

Moving boundaries

Then we decided to modify the approach and calculate moving upper and lower boundaries for each value (for example, 90 and 10 percentile of metrics distribution), and also tried calculating discrepancies for standard deviations (for example, we took ±2σ, ±3σ) so that an alert was triggered whenever these limits were exceeded.

We used this approach as an interim solution until we could find a more suitable option. Illustration of the approach:

The main drawback here is low sensitivity, which means that minor metric fluctuations can be missed out. For example, the situation illustrated by the graph tells us that relatively high deviations from normal status were designated as anomalies, but the drop of November 21, 2022 was left unattended. The metric declined inside broad boundaries and therefore wasn’t noticed by the model.

ML methods

We decided to test out machine learning options. We considered ML tools employing other approaches, such as XGBoost library, regressions, econometric models, and ARIMA. We also tried Twitter Anomaly Detection library, but we didn’t like that it works as a black box and often produces false alarms. Eventually, we wrote an algorithm to forecast time series using very friendly and widely used library Prophet, because it showed the lowest MAPE error with our metrics. We chose MAPE because it is one of the most commonly used and easy to interpret forecasting accuracy indicators. MAPE is a ratio thus making it possible to compare forecasting errors between different metrics (for example, DAU and clicks per unique user).

We liked the idea similar to synthetic control method (SCM): first, a predictive model is trained on historical data with consideration of trend, seasonality and holidays; point forecast is built, as well as 95-percent confidence interval for the forecast. Then, actual values are examined to see whether they go beyond the limits of this interval. If yes, notification is received.

Model training on historical data:

Metric forecasting:

Our algorithm

We wrote a script to collect metrics via API of Redash (a service that allows you to handle large amounts of data, create requests to databases, and build visualizations). Then we picked over different combinations of hyperparameters and selected those which resulted in the lowest MAPE error between the actual and the forecast. We applied Box-Cox transformation for higher forecast quality and then trained Prophet model.

The output was our baseline level for us to compare actual metric changes with. The forecast has upper and lower values which are limits of confidence interval (for example, in Prophet this parameter’s default value is 0.95, the one we used). The model checks whether actual metric values are beyond these limits once a day and, if yes, then alarm is triggered.

This is how it looks in practice:

We’ve trained the model to forecast well — MAPE error is basically kept at a minimum not exceeding 10% (and is often much lower). The model can detect even minor metric fluctuations, but we are interested in sharp spikes caused by an event or a series of events.

Furthermore, we didn’t restrict ourselves to Prophet because sometimes it makes more sense to use a simpler solution. Therefore, we use forecasting with Prophet as our main method but in some cases we compare planned vs. actual values, and if a metric is beyond the limits of a variance percentage then a notification is sent.

In parallel with finding the best approach, we also reflected on the system’s friendliness because otherwise nobody would want to use it. So we’ve put focus on the user experience as well.

User experience

Initially, we used off-the-shelf solutions for visual display which turned out to be not quite suitable for us, so we decided to write our own.

Redash

We used Redash which features a ready-made solution for sending email notifications. First, an analyst writes an SQL query for any anomalies, indicating the check frequency:

Here, it is possible to view the list of configured notifications: metric name and status at time of updating query data. “OK” means normal status, and “TRIGGERED” indicates an anomaly.

You can configure notification sending by:

Specifying a value beyond which a metric is not supposed to go; or
Using an additional boolean field depending on the condition specified in SQL query.

You can add an email list for sending notifications:

However, emails would notify only the fact of an anomaly found, while we wished to add a detailed description. Furthermore, notification can be bound to only one field, so we were unable to set more complex conditions of check.

Email notifications

Our decision was then to write our own script for sending email notifications with all metrics and snapshots information we wanted. This is where we introduced Prophet-based model. We had all the collected metrics processed by the script every day, and it sent out emails whenever alarm was triggered.

The emails contained detailed information: metric name, description, links to queries and dashboards, etc.:

But sometimes there was a confusion because of emails with similar subjects — alarm notifications and working discussions. Therefore, we decided to try Telegram bot for messaging.

Telegram bot

Following a survey among colleagues who were interested in our notification system, we concluded that messenger was indeed more convenient for most of them, and many liked messages from Telegram bot more than emails.

The bot allows you to select the desired metrics and subscribe to automatic notifications on them, or to view the complete list of metrics and export manually the information you want: graph, description, link to dashboard and Excel file with detailed data.

The bot sends the list of main commands upon start:

/alerts — to automatically subscribe to periodic alert notifications for particular metrics;
/metrics — to manually view graphs, receive Excel files, etc. for any metric at any time. For example, you can review a metric first, and then subscribe to alert notifications for that metric using /alerts.

There are metric buttons in the /alerts section. Users can subscribe to a metric notification by pressing a button.

All metrics are recalculated automatically once a day. The bot reports all cases of acceptable limits being exceeded at 03.00 p.m.

The /metrics section has buttons to download detailed data on particular metrics:

The Telegram bot turned out to be more convenient than emails. It helped us notice a drop in money metrics on time and saved us several million rubles. Thanks to timely notification, we noticed a drop in RPM metric of full-text ads on mobile devices.

We were able to respond right on the next day, and that allowed us to reduce potential losses by several hundred thousand rubles per day. Had we responded to the issue a week later, we would have lost over a million rubles.

Another example is a drop in CPM dtd (target+direct) metric. Potential losses could reach several hundred thousand rubles per day.

There are also metrics for which we want to receive notifications more frequently than once a day. So we added hourly notifications as well. These are based on a different logic, without the use of Prophet. To make things simple, consider this example:

We have an ad format with payment for clicks with forecast revenue for each partner getting money for clicks, and we know daily limits for each partner. There is also a plan according to which we must provide traffic to a partner throughout the day. The bot tracks ad shows every hour and compares planned vs. actual values. We’ve configured notifications so that they are received every three hours to avoid nuisance. When planned vs. actual difference exceeds 10%, a notification is received:

If the issue has been fixed and potential losses compensated within the previous three hours, no notification is sent next time. For example, if ads have not been shown as planned at noon, then the bot sends a notification, and colleagues start working on the issue. If clicks are replenished to the planned value before 03.00 p.m., the bot will not send alert notification.

Hourly messages help us see if the shows plan is fulfilled during the day and literally save money by responding to deviations from plan and avoiding excessive display traffic.

Tangible effect

The bot can already do many things, but the good thing is that it can be developed further. For example, you will receive too many messages if you subscribe to all metrics. To avoid overloading our colleagues, we are now working to prioritize metrics so that the most important changes are notified first. There are a myriad of metrics, all of them affecting the company’s revenue differently. We added conversion from “native” measurement units into money, and now all indicators can be compared directly. Consider RPM and DAU for example. Logically, conversion can be described as follows:

That is, we calculate the impact of each of the two metrics as difference between the module of actual value and forecast value, then multiply the obtained values by the measurement unit (hits (page loads) for RPM, and ARPU DAU for DAU) and have the result in rubles.

By sorting converted metrics, you can subscribe first to notifications for those with the higher potential loss.

Conclusions

The main advantage of our Telegram bot is the configured smart priorities which enable us to respond to the most critical anomalies where potential financial loss is substantial. Therefore, we can save both time spent to detect an issue and estimate lost revenue, and money.

Today, our notification system forecasts changes in tens of product metrics and alerts of anomalies every day enabling us to track deviations on both daily and hourly basis. A remarkable advantage of the Telegram bot is friendliness of the messenger and its API: employees receive messages according to specified conditions and schedule with all necessary descriptions, graphs and links to detailed statistical data in the monitoring system. No need to spend time on tracking anymore: the system will automatically alert you of possible issues.