Improving Multimodal Data Crowdsourcing: Less Assessors, More Layers!

20 min readAug 18, 2020

Abstract

Hello, we are researchers from ML lab at ITMO University and CoreML team at VK. Automatic post classification is an important task for VK. It’s used not only to compose thematic feeds for users, but to also identify inappropriate content. Assessors are involved in solving this task. The cost of using them can be significantly reduced by using machine learning techniques such as active learning. In this article, we will discuss the use of active learning in classifying multimodal data. We will consider the general principles and methods of active learning, the implementation and usage details with respect to the described task, as well as the insights obtained during our study.

Introduction

Active learning is a part of supervised machine learning. A student model interacts with a teacher model, requesting only training data that will allow the model to learn better and, as a result, faster.

This technique could be useful for companies that hire assessors for data labeling (for example, through the use of Amazon Mechanical Turk and Yandex.Toloka services), and are looking for ways to make this process cheaper. For example, ReCAPTCHA gets annotation for Google Street View done for free by having users select photos where, say, traffic lights are present. Instead of this method, active learning could be used.

Some companies are already using and talking about active training to optimize crowdsourcing.

One of them is Voyage, a company that specializes in self-driving cars. In their article, they discussed how active learning can be used, and concluded that it not only allows to save on data labeling but also increases the maximum accuracy of the model. Their approach to active learning is very similar to the one we used in our research.

Amazon describes the DALC (Deep Active Learning from targeted Crowds) framework, which explores the concept of active learning in terms of neural networks, the Bayesian approach and crowdsourcing. The study also uses the Monte Carlo dropout technique, which was also used in our research. They also introduced the concept of “noisy annotation“: while in most studies on active learning the assumption is that the assessor ”tells the truth and nothing but the truth”, there it is assumed that some human mistake factor might be present.

Another study from Amazon can be found here. It deals with the concept of hierarchical labeling, when the assessor, instead of standard single-class labeling of an object, must give a binary answer (yes/no) about the object belonging to a certain superclass/class in the hierarchy. In this case, the binary questions are selected by the algorithm itself, along with the object for labeling. Thus, the final labeling may be incomplete (the object’s category could be defined instead of the final class), but this is sufficient for training.

Enough talking about how active learning can be used. Let’s go ahead and define it :)

There are several basic approaches, or scenarios, of active learning. In our study, the model interacts with the teacher in a pool-based sampling scenario.

Fig. 1. General scheme of pool-based active learning scenarios

The essence of this scenario is as follows: suppose there is a certain amount of labeled data on which the model is already trained (we call this the passive phase). Later, using this model in its current state, it’s possible to evaluate unlabeled data on its “usefulness” for training.

The “most useful” data is sent to an expert for labeling, then returned to train the model further (this is the active phase). At the same time, the data sent for labeling is called a query. In a pool-based scenario, queries are grouped in a pool. The methods for selecting pool objects to send in a request to an expert, or, in other words, methods for assessing the “usefulness” of data, are called active learning strategies. Next, we describe the problem and the dataset, and consider specific strategies for active learning on the example of the considered task.

Task and dataset

As you might remember, our general task is the classification of VK posts, where each post is a multimodal object consisting of image and text. The provided dataset includes ~250 thousand post embeddings, where each object (post) optionally contains (1) a vector representation (embedding) of the post’s picture and (2) a vector representation of the text, and is labeled with one of 50 classes (post topics). It worth noting that the dataset is highly unbalanced (see Fig. 2).

Fig. 2 — Histogram of class distribution

Baseline classification model

An important step in solving any active learning task, as well as any other machine learning task in general, is to choose the optimal baseline model.

One of the key requirements for the model is the absence of overfitting, since active learning implies constant fine-tuning of the model. If the model is overfitted, no matter how we choose to select new data, the accuracy will not increase significantly, or may even decrease. Of course, it’s possible to train the model from scratch at every step of the active learning phase using early stopping to avoid overfitting. However, this will make the experiments take too long, since instead of only one epoch of fine-tuning, dozens of epochs will be needed.

In this article, we experimentally studied various configurations of baseline deep neural networks. We tried the following techniques and architectures: residual connections, highway blocks, and encoders. We also considered the following techniques based on fusion to take into account the multimodality of data: attention for multimodal data and matrix fusion. Some methods of accounting for multimodality of data, such as alignment and training based on various representations, could not be applied to this task because of the given dataset, which was provided to us in the form of pre-trained embedding vectors.

However, since the subject of this article is not the classification of multimodal data itself, but active learning, we omit the detailed description of the model selection process, and describe only the final result.

The criterion for choosing the final model was the maximization of validation accuracy. Therefore, the following architecture was chosen as a classifier (Fig. 3):

Fig. 3. Similar basic architecture for classification

In this model, a late merge of modalities is performed. The idea is that the embeddings of the picture and text are first processed separately (the picture is encoded), and only then combined. This approach allows us to reduce the size of the neural network, which first extracts the necessary information from each modality, and then combines them for the final prediction. In addition, the three heads of the model (text only, image only, mixed) additionally force the network to train weights, extracting as much relevant information by each modality for classification as possible.

The red and blue blocks in Fig. 3 have the following form:

Fig. 4. Description of the main blocks of the basic neural network model for classification

Initially, a model with only one output was implemented. While we planned to use additional outputs for various active learning strategies’ results, it turned out that the chosen architecture shows greater accuracy than a similar model with one output for two modalities.

One important issue related to the chosen architecture is how to properly calculate the loss function. Possible options are (1) a simple component-wise summation of the elements of the loss function from different heads, (2) a weighted loss function with manual (e.g. using grid search) weighting of the heads’ components, (3) a weighted loss function with tuned heads’ components. We chose the third option and, inspired by an article on the topic of Bayesian deep learning taking into account the aleatoric uncertainty of the model, which appears due to the noisy nature of data, we chose the following loss function:

where L₁, L₂, L₃ are the loss functions of different outputs of the model, in our case representing categorical cross-entropy, and σ₁, σ₂, σ₃ are tuned parameters representing the variance and noise of the data.

Pool-based sampling

After choosing the baseline model, we implemented and evaluated various strategies for active learning. According to the pool-based sampling scenario, the following experimental pipeline was used:

Sample a number of random objects from the training dataset.
Train the model on these objects.
Make a query to select a new data pool from the remaining training set based on the selected strategy, and add them to the labeled data.
Fine-tune the model.
Get the values of metrics (validation accuracy).
Repeat steps 3–5 until a specific criterion is reached (for example, until the entire training dataset is exhausted).

The first two steps correspond to the passive training phase, while steps 3–6 correspond to the active phase.

In addition to the strategy itself, the two following parameters are significant in this pipeline:

The size of the initial dataset on which the model is trained during the passive phase. If this parameter is too small, it will be difficult to assess the effect of active learning compared to fine-tuning on randomly chosen data, since the accuracy will rapidly increase in both cases. If the size of the initial labeled set is too large, the model will already be well trained in the passive phase, and the increase in accuracy in the active phase will be weak regardless of the training method. In our case, the optimal size of the initial dataset is 2000.
The size of the query to the assessor. While objects can be sent to the assessor one at a time, since the first object in the query will maximize the criterion of the active learning strategy (when sorting objects in descending order of compliance with the criterion), the remaining objects in the query can lose their usefulness after training on this object. If we select objects one at a time, this will also significantly increase the duration of the experiment and complicate the study in general. Therefore, we chose to use 20 objects in a query.

In addition, the number of steps in the active learning phase can be varied. Obviously, as the number of steps increases, the accuracy of the model would increase as well. However, since the main goal of the project was not to achieve the maximum possible classification accuracy, but to study the effectiveness of active learning, we used a fixed number of steps equal to 100 or 200.

Now that we have described how and on what to test the active learning strategies, we’ll move on to its implementation.

Insight #1: effect of batch size

As a baseline, let’s consider how the model trains with random data selection (passive learning) (Fig. 5).

Fig. 5. Plot of passive training of baseline models. The result of five runs with a confidence interval is given.

For reliability, this and all subsequent experiments were run five times with different random states, while the plots show the average accuracy of runs with a confidence interval.

Here we have our first insight on active learning for our task. As you can see, the learning curve dips at certain intervals, even though intuitively it might seem that accuracy should increase monotonously.

Tuning the batch size parameter helped eliminate this. Due to the large number of classes (50), it was defaulted to 512. However, it turned out that, with the finite size of the labeled dataset and a fixed batch size, the last batch can be extremely small, which introduced noise into the value of the gradient and negatively affected the training of the whole model. The following solutions to this problem were tested: (1) data upsampling, so that the batches have the same length, (2) increasing the number of training epochs, so that the influence of a small batch dispersed by subsequent batches. The solution that worked for us was using an adaptive batch size: at each step of active learning phase, it was calculated according to formula (1).

where b is the original batch size, and n is the current size of the labeled dataset.

The adaptive approach helped smooth out the accuracy curve to get a monotonously increasing plot (Fig. 6).

Fig. 6. Comparison of using a fixed parameter batch size (passive on the chart) and adaptive (passive + flexible on the chart)

Note: the plots given are for a model with one output, but, without loss of generality, adaptive batch size can be applied to a model with three outputs, which was used in further experiments.

Now we go directly to the study of active learning methods for our task.

Uncertainty sampling

As the first method, the simplest strategies for active learning from a review article were implemented, namely uncertainty sampling methods. As the name implies, the strategy is based on the query, which contains objects in which the model is the most uncertain in its predictions for.

This article provides three options for calculating uncertainty:

Least confident sampling.

In this strategy, an object is passed to the expert for labeling, the most probable predicted class for which the model has the least confidence:

where ŷ is a class most likely to be chosen by the model, y is one of the possible classes, x is one of the dataset objects, and

is an object selected with the least confidence strategy.

This measure can be understood as follows. Let’s say the loss function on this object looks like 1-ŷ. In this case, the model selects an object on which it will receive the worst estimate of the value of the loss function. Then it is trained on it, reducing the value of the loss function.

However, this method has a drawback. For example, for one object, the model received the following three class distribution: {0.5; 0.49; 0.01}, and on the other, {0.49; 0.255; 0.255}. In this case, the algorithm will select the second object, since its most probable prediction (0.49) is less than the most probable prediction of the first object (0.5). Although it is intuitively clear that the first object has greater information gain for learning, since the probabilities of the first and second classes in the prediction are almost equal. Taking such situations into account, the algorithm needs to be modified.

2. Margin sampling

According to this type of strategy, the algorithm will send objects for labeling for which two classes have the highest probability, and these probabilities are close:

where ŷ₁ is the most probable class for an object x, and ŷ₂ is the second most probable class.

From the point of view of information gain, this method is more advantageous, since the algorithm takes into account twice as much information about the probability distribution of classes. However, the method is also not ideal, since distributions of all other classes are not taken into account. For example, the popular MNIST dataset of handwritten digit classification contains ten classes, thus only 1/5 of the distribution information is taken into account. The method of entropy sampling is intended to overcome this drawback.

3. Entropy sampling

In this type of strategy, a value of entropy is used to measure the model’s uncertainty:

where yᵢ is the probability of the i-th class for an object x classified by the model.

The entropy method is convenient because it generalizes the two methods described above, choosing both objects where the most probable prediction is less important than predictions for other objects and those for which the two most probable classes have similar values.

According to a review article, each of the listed methods takes into account more information than the previous one; therefore, it was initially expected that the entropy sampling method would be the most effective.

However, practical results for our problem showed a discrepancy with the theoretical assumptions (Fig. 7).

Fig. 7. Comparison results of various types of uncertainty sampling strategy with passive training (from left to right: the method of least confidence, the method of margin sampling, the method of maximum entropy)

As can be seen, the least confident and entropy sampling methods have shown themselves to be worse than passive training with random selection of objects for further training, while margin sampling turned out to be the most effective.

To prevent the reader’s suspicion of bugs in the implementation of the methods, it is worth noting that all the methods were also tested on the MNIST dataset, on which, for example, the entropy sampling strategy demonstrated results that did not contradict the theoretical effectiveness of the method. Thus, we can conclude that the practical effectiveness of the described methods is ambiguous and depends on the specific problem being solved.

The listed methods are simple to implement and have low computational complexity. The complexity of one query to an expert can be estimated as O(p log q), where p is the size of the unlabeled dataset, and q is the number of objects in the query to the expert. Also, these methods are easy to apply in practice since they do not require making changes to the model.

BALD

The next strategy that will be discussed is BALD sampling (Bayesian Active Learning by Disagreement). BALD is a Bayesian approach to measuring the uncertainty of a model committee.

From the point of view of active learning classification methods, this method is a part of the query-by-committee strategy, the main idea of which is to use the predictions of several models with competing hypotheses. With several models we can, for example, use their average prediction as the basis for uncertainty sampling. We can also calculate the disagreement of the models and choose objects for labeling among those about which the models disagree the most in their predictions. Experiments were conducted with the QBC method based on Monte Carlo Dropout, which will be discussed later on.

The problem with classical Bayesian methods for deep learning is the need to tune a large number of parameters, which makes training models twice as expensive. Therefore, the authors suggested using dropout as a method for Bayesian approximation. The difference between this approach and how dropout is usually used is that in this method, dropout is used during the inference (testing) stage. For each sample object, prediction is made several times by the same model but with different dropout masks (Fig. 8). This sampling method is called Monte Carlo Dropout (MC Dropout) and does not require an increase in memory cost for training the model. Thus, using one model, several predictions can be obtained, which could be different for the same object. Model disagreement (where models differ only in dropout masks) is considered based on Mutual Information (MI). MI here also represents the epistemic uncertainty, or uncertainty of the committee. It’s a kind of uncertainty that becomes smaller with the addition of new data, which is consistent with the concept of active learning in general.

Fig. 8. Monte Carlo Dropout for BALD method illustration

To start with, we used an averaged prediction of the QBC based on the MC Dropout committee, and then applied various methods of uncertainty sampling to it. In comparison with the corresponding methods that use only one prediction, it did not lead to an increase (Fig. 9).

Fig. 9. Comparison results of various types of uncertainty sampling strategies based on QBC and without it with passive training (on the left, the method of least confidence; in the center, the method of minimum margin; on the right, the method of maximum entropy)

The next step was to use the BALD committee’s measure of disagreement. As already mentioned, the mutual Information of committee models is used for this:

where n is the number of classes, k is the number of models on the committee.

The first term in (5) is the entropy of the averaged prediction of the committee, and the second term is the average entropy of each model separately. Thus, the only objects selected are ones with predictions the committee agrees about the least. The results of applying the BALD method are presented in Fig. 10.

Fig. 10. Results of applying the BALD strategy in comparison with passive learning

Unfortunately, this method did not give the expected result in the long run of the experiment despite an increase compared to the passive method at the beginning of the training.

The complexity of the query-by-committee strategy algorithms in general and BALD in particular is proportional to the number of predictions made for each object. In turn, the prediction complexity for each object is similar to the uncertainty sampling methods. Thus, the complexity of one query is O (kp log(q)), where p is the size of the unlabeled dataset, q is the number of objects in the request to the expert, and k is the number of predictions calculated for one object. In practice, applying the BALD method can be difficult using the tf.keras framework as it does not have sufficient flexibility to work with layers. Therefore, for the purposes of this project, the PyTorch framework was chosen. It allowed us to not only easily implement dropout during inference but also to disable batch normalization during the active phase, which will be discussed later.

Insight #2: disabling batch normalization in the active phase

The selected classification model uses batch normalization layers in its architecture. The goal of the batch normalization approach is in tuning data normalization parameters during training and applying the resulting parameters during inference. We used the idea of considering the active learning phase as an inference stage and disabling batch normalization training for it. In addition, it seems intuitive that such an approach will help avoid model bias. To our knowledge, this issue has not been investigated in relation to active learning methods. For the experiments, the BALD method described above was taken as a baseline. Let’s consider the results of the experiments (Fig. 11).

Fig. 11. BALD strategy with disabled batch normalization compared to the standard method and passive learning

As can be seen, such an approach allowed the strategy to bypass passive learning, and thus we get yet another unexplored feature of active learning.

An important aspect for using disabled batch normalization successfully is having a sufficiently large and diverse dataset for training in the passive phase, since the accuracy of the model strongly depends on the normalization parameters found for the initial sample.

Learning loss

Now let’s considerthe problem of active learning from another angle. Suppose that the uncertainty of the model with respect to the classes of certain objects is proportional to the values of the loss function for these objects. However, until we know the real class of the object, we cannot calculate the value of the loss function.

We created an auxiliary model that receives the outputs of the intermediate and last layers of the main model. The goal of the auxiliary model is to predict the value of the loss function. For the query, we will select objects with the maximum predicted loss value. This method is called learning loss; more about it can be found here. Now let’s consider the results of primary experiments using the method for the baseline model (Fig. 12).

Fig. 12. Learning loss application results for the baseline model compared to passive learning

Compared to passive learning on randomly selected objects, the learning loss method did not lead to an increase. So the next step would be to use this method for models of other architectures or to consider it ineffective for our task.

Instead, we tried the following experiment. While real class labels are unknown to the model in the usual scenario of active learning, they are known in our task, which allows us to calculate the “ideal” learning loss. Knowing the real class label of the object, we will consider the value of the loss function on it and add those objects to the labeled dataset which has the largest value of the loss function. We call this strategy ideal learning loss (Fig. 13).

Fig. 13. Results of applying ideal learning loss to the baseline model in comparison with its passive learning

Despite the intuitive effectiveness of this approach, it proved to be worse than the basic method of learning loss.

There was an assumption that the value of the loss function is weakly dependent or even inversely proportional to the accuracy of the model.

To check this, we can build a correlation of the accuracy of the model trained on a sample and the average value of the loss function for the objects of this sample. As a result, we have the following experimental pipeline:

Train the model on the initial sample (2,000 objects), same as in active learning.
Select 10,000 objects from the entire set of unlabeled data (to speed up the calculation).
Calculate the values of the loss function for the selected objects of the unlabeled sample.
Sort objects by the values received.
Split objects into groups of 100.
Train the model on each group in parallel, starting from the weights obtained in step 1.
Get the resulting accuracy.

Next, we consider the Spearman correlation between the accuracy of the model trained on a particular sample and the average value of the loss function for each of the sample objects. We also calculate how the average accuracy of the model correlates with the average value of the margin metric (from the margin sampling method).

Table 1. Accuracy and active learning metrics’ correlations for the VK posts dataset

As can be seen, if the correlation is weakly positive for the margin sampling method, that is, knowing the margin, the object’s usefulness for training the model can also be known, then in the case of the loss function the correlation is weakly negative.

This leads to the following question. What if we try to choose objects with the lowest values of the loss function for labeling?

Oddly enough, the experiments showed that this option also does not work as expected (Fig. 14):

Fig. 14. Results of applying the reverse ideal learning loss for the baseline model in comparison with direct ideal learning loss and passive learning

Despite the poor results for the target dataset, the following correlation values were obtained for the MNIST dataset:

Table 2. Accuracy and active learning metrics’ correlations for the MNIST dataset

Moreover, the ideal learning loss method itself works as expected (Fig. 15).

Fig. 15. Active learning of the digit classifier from the MNIST dataset with the ideal learning loss strategy. Blue graph — ideal learning loss, orange — passive learning

It turns out that the method works well, assuming that it is more profitable for the model to train on the data with the highest value of the loss function. However, it does not work for our dataset.

The complexity of the learning loss method is the same as for the uncertainty sampling methods: O (p \log{q}), where p is the size of the unlabeled dataset, and q is the number of objects in the query to an assessor. For its application, it’s important to consider that you need to train not only the basic model but the auxiliary one as well. This method is more complicated than the previous ones in its applicability in practice since it requires using an auxiliary model and implementing the training of a models cascade.

Conclusion

This article isn’t long enough to go into all of the applied methods and experiments. Here we described only the main and most interesting ones. It’s worth noting that one of the first and simplest methods, margin sampling, turned out to be the most effective. The results of a long run of it can be seen in Fig. 16.

Fig. 16. Comparison of training on randomly selected data (passive training) and on data selected by the margin sampling strategy

The plot shows that by training the model using active learning (in our case, the margin sampling strategy), extreme accuracy can be achieved, similar to the accuracy of a model trained in a passive way, but using ~25,000 less objects. As a result, crowdsourcing resources will be saved by about 25%, which is quite significant.

It’s also worth pointing out that in our task, the effectiveness of the methods was limited by the size of the dataset, whereas in real life, the sample can be much larger, and therefore, the methods of active learning will have more options for queries and higher potential efficiency.

In addition, all of the methods described are simple to implement and have low computational complexity. However, there are some points that you should pay attention to during implementation: the choice of batch size and the advisability of using certain approaches that have proven themselves in deep neural networks for active learning, such as, for example, the batch normalization method.