Read Instead of Listen: How Speech Recognition Works on VKontakte
When it comes to messages, reading is faster than listening. It’s also easier to skim through text to find and verify details. However, sometimes there are situations where it’s a lot more convenient to just send a voice message than type it all out.
My name is Nadya Zueva. In this article, I’ll discuss how we at VKontakte were able to help broker peace between lovers and haters of voice messages using automatic speech recognition. I’ll share with you how we came up with our solution, which models we use, what data we trained them on, and how we optimized it to work quickly in production.
We started conducting research on voice message speech recognition in 2018. At that time, we thought that it could become a cool product feature and a true challenge for our applied research team. Voice messages are recorded in conditions that are far from ideal, and people speak using a lot of slang and don’t really care much for proper diction. And at the same time, speech recognition needs to be fast. Spending 10 minutes transcribing a 10-second voice message is not an option.
In the beginning, we conducted all of our experiments using English speech, as there are a lot of good data sets in English, and learned how to recognize it. However, a large portion of the VK audience speaks Russian, and there were no open-access Russian data sets that we could use to train our models. Now, the situation with Russian data sets is better as there is Golos from Sber, Common Voice from Mozilla, and several others. But before this, it was a separate problem that we had to solve by creating our own data set.
The first version of our model was based on wav2letter++ from Facebook AI Research and was ready for experiments in production in 2019. We launched it in silent mode as a feature for searching for voice messages. With this, we were able to verify that speech recognition could be useful for turning voice messages into text, and we began investing more resources into creating this technology.
At the beginning of 2020, our task was more than simply creating a precise model. We had to increase performance for our multimillion audience. An additional challenge for us was slang. We had no choice but to figure out how to parse it.
Now, the speech recognition pipeline consists of three models. The first is the Acoustic model, which is responsible for recognizing sounds. The second is the Language model, which forms words from the sounds. And the third is the Punctuation model, which adds punctuation to the text. We’ll go over each of these models, but first, let’s prepare the input data.
For ASR tasks, the first thing you have to do is transform the sound into a format that a neural network can work with. By itself, the sound is saved in the computer’s memory as an array of values that show amplitude oscillation over time. Usually, the sample rate is counted in tens of thousands of points per second (or kHz), and the resulting track turns out to be very long and difficult to work with. Therefore, before running it through the neural network, the sounds are preprocessed. They’re converted into a spectrogram, which shows the intensity of sound vibrations at various frequencies over time.
The approach using a spectrogram is considered to be conservative. There are other options as well, such as wav2vec (which is similar to NLP’s word2vec, but for sound). Despite the fact that state-of-the-art ASR models currently use wav2vec, this approach didn’t provide an improvement in quality for our architecture.
After a raw signal has been converted into a convenient format to use with neural networks, we are ready to recognize speech by getting a probability distribution of phonemes over time from the sound.
Most approaches first create phonetic transcriptions (basically, “what is heard is what is written”), and then a separate language model “combs through” the result, correcting grammatical and orthographic mistakes, and removing extra letters.
Markov models were used as simple acoustic models for speech recognition (for example, in aligners). Now, neural networks have replaced these models to fully recognize speech, but Markov models are still used to break up long audio signals into smaller fragments, for example.
In 2019, when we were actively working on this project, a considerable number of speech recognition architectures already existed, such as DeepSpeech2 (SOTA 2018 based on the LibriSpeech data set). It consists of a combination of two types of layers — recurrent and convolutional. Recurrent layers allow phrase continuations to be generated with careful attention to the previously generated words. And convolutional layers are responsible for extracting features from spectrograms. In the article about this architecture, the authors used CTC-loss for training. This makes it possible for the model to recognize words like “Privye-e-e-et” and “Privyet” (in English, “Hello-o-o-o” and “Hello”) as the same without stumbling over the length of the sound. Actually, this loss function is used in recognizing handwritten texts as well.
A bit later, wav2letter++ by FAIR was released. What made it unique was that it only used convolutional layers without autoregression (with autoregression we look at the previously generated words and go over them consecutively, which slows down the neural network). The creators of wav2letter++ focused on speed, which is why it was created using C++. We started with this architecture when developing our voice message search.
Using fully convolutional approaches opened new possibilities for researchers. Not long after, the Jasper architecture appeared, which was also fully convolutional but used the idea of residual connections, just like ResNet or transformers. Then came QuartzNet from NVIDIA, which was based on Jasper. This is the one we used.
Right now, there is Conformer, which is the SOTA solution as of the moment of writing this article.
In this way, regardless of what architecture we chose, the neural network receives a spectrogram as input and outputs a distribution matrix of each phoneme’s probabilities over time. This table is also called an emission set.
Using a greedy decoder, we could already get an answer from the emission set data by selecting the most likely sound for each moment in time.
But this approach doesn’t know much about proper spelling and is likely to provide answers with many mistakes. To fix this, we use beam search decoding using weighting hypotheses using a language model.
After we get an emission set, we need to generate text. Decoding is not only done using probabilities that our acoustic model gives us. It also takes the “opinion” of the language model into consideration. It can let us know how likely it is to come across such a combination of characters or words in a language.
For decoding, we use the beam search algorithm. The idea behind it is that we don’t only choose the most likely sound for a given moment but also calculate the likelihood of the entire chain taking previous words into account and saving the top candidates on every step. As a result, we select the most likely variant.
When selecting candidates, we assign a probability to each, taking the answers from the acoustic and language models into account. You can see the formula in the picture below.
Okay, we’ve gone over beam search decoding. Now we need to look at what the language model can do.
We used n-grams as a language model. From an architecture standpoint, this approach works quite well as long as we’re talking about server-side (not mobile) solutions. Below, you can see an example for n=2.
What’s much more interesting here is how we preprocess the data for training.
When writing back and forth, people often use abbreviations, numbers and other symbols. Our acoustic model only knows letters, therefore, in our training data, we need to differentiate situations when “1” means “first” and when it means “one” or “a single”. It’s hard to find a lot of texts with casual speech, where people write “give me back three hundred and eighty-six rubles by December twentieth” instead of “give me back 386 rub by December 20th”. Therefore, we trained an additional model for normalization.
For its architecture, we chose transformer, which is a model that is often used for machine translation. Our task is similar to MT in its own way. We need to translate denormalized language into normalized language where only alphabet characters are used.
The language model gives us a sequence of words that are in the language and “go together” with each other. It’s a lot easier to read and understand than the output of the acoustic model. But for long messages, the result is still not that great because there could be some ambiguity.
After we get a readable string of words, we can add punctuation. This is particularly useful when the text is long. It’s a lot easier to read sentences that are separated by periods, and in Russian, other types of punctuation, such as commas, are actively used. Even short sentences need to have them.
The architecture that our punctuation model is based on is an encoder from the transformer and a linear layer. It performs clever classification by predicting whether a period, comma, dash or colon is needed after each word or not.
The approach to generating training data here is similar. We take texts with punctuation marks in them and artificially “spoil” this data by removing the punctuation. Then we train the model to put them back in.
As I mentioned at the beginning of the article, when we started our research, there were no open sources for Russian-language data available for training speech recognition systems. For some time, we experimented with English. We then came to understand that in any case, we need recordings of language that’s as casual as possible instead of professionally read audiobooks, like in LibriSpeech.
In the end, we decided to collect the training data ourselves. To do so, we got VK Testers involved. We prepared short texts of 3–30 words which they dictated to us in voice messages. We created the texts that were to be recorded ourselves using a separate model, which we trained on comments from public communities. This way we got a distribution from the same domain, one where slang and casual speech are common. We asked the testers to record the voice messages in different conditions and speak as they normally would so that our training data resembled what would be used in real-life situations as closely as possible.
Launching into Production
As everyone knows, going from models described in articles (and even their implementations by ML specialists) to actually using machine learning in production is a long journey. Therefore, our VKontakte infrastructure team started working on a voice message recognition service at the very beginning of 2020, when we still had our first version of the model.
The infrastructure team helped turn our ML solutions into a high-load, reliable service that has high performance and is efficient with server resources.
One of the problems was that we in the research team work with model files and the C++ code that launches them. But the VKontakte infrastructure is primarily written using Go, and our colleagues had to find a way to make C++ work with Go. For this, we used the CGO extension so that the higher-level code could be written in Go and decoding and communication with models remained in C++.
Our next task was to process voice messages within several seconds but make this processing work with our hardware limitations and use server resources efficiently. To make this possible, we made voice recognition work on the same servers as other services. This caused a problem with shared access to GPU and CUDA kernels from several processes. We solved this using MPS technology from NVIDIA. MPS minimizes the influence of blocks and downtime, making it possible for us to use the video card to the fullest without the need to rewrite the client.
Another important point to consider is the grouping of data into batches for effective processing on the GPU. The thing is that in the acoustic model batch, all audio files should be the same length. Therefore, we needed to equalize them by adding zeros to shorter tracks. However, they also go through the acoustic model and take up GPU resources. As a result of equalization, short messages took more time and resources to parse.
To completely get rid of extra zeros is impossible due to the variability of recording length, but their number can be reduced. To do this, the infrastructure team came up with a way to split up long voice messages into 23–25 second fragments, sort all of the tracks and group ones similar in length into small batches which are already on their way to be sent through the video card. This division of voice messages was done using the VAD algorithm from WebRTC. It helps recognize pauses and sends full words to the acoustic model, not fragments of them. The 23–25 second length was chosen as a result of experiments. Shorter fragments caused a reduction in recognition quality metrics, and longer ones needed to be equalized more often.
The approach with splitting up long recordings, aside from optimizing performance, made it possible for us to transcribe practically any length of audio into text and opened a field for experiments with ASR for other product tasks, such as automatic subtitles.
In June 2020, we launched voice message recognition into production for messages of up to 30 seconds long (about 90% of all such messages fit within this duration). We then optimized the service by integrating a smarter way of cutting up tracks. Since November 2020, we can recognize voice messages of up to two minutes long (99% of all voice messages). Our infrastructure is ready for future projects and will allow us to process audio files that are several hours long.
Of course, since its launch a year ago, a lot has changed. We’ve updated the acoustic and language model architecture and added noise suppression.
Currently, the entire pipeline looks like this:
- We get an audio track, pre-process it and turn it into a spectrogram.
- If the track is longer than 25 seconds, we cut it up using VAD into fragments of 23–25 seconds. This variation helps us avoid cutting words in half.
- Next, all fragments are run through the acoustic (based on QuartzNet) and language (using n-grams) models.
- Then all of the fragments are pieced back together and are put through the punctuation model with our custom architecture. Before this stage, we also break up texts that are too long into 400-word segments.
- We put all the segments together and give the user a text transcription that they can quickly read whenever and wherever saving them time.
All of this together comes to form a unique service. It can not only recognize casual speech, slang, swearing and new words in noisy environments but does so quickly and efficiently. Here are the percentiles of full voice message processing times (not counting download and sending to the client):
- 95th percentile: 1.5 seconds
- 99th percentile: 1.9 seconds
- 99.9th percentile: 2.5 seconds
The number of voice messages sent on VKontakte rose by 24% year-over-year. 33 million people listen and send voice messages every month. For us, this means that our users need voice technology and it’s worth investing in developing new solutions.
There is a wide range of possibilities and perspectives for ASR. In order to rate the perspectives of its implementation into your product, you don’t a large staff full of researchers. To begin with, you can try to fine-tune open-source solutions to meet your tasks, and then conduct your own research and create new technologies.