Neural Networks vs. Vacuum Cleaners, or How We Denoised VK Calls

These days, what other way is there to implement noise suppression other than neural networks? But we aren’t here to just point out the obvious and disappear into thin air.

My name is Vitaly Shutov, I’m a Machine Learning Engineer at VK.com and in this article we’ll talk about the development of noise suppression and speech enhancement technologies. We’ll take a look at the options for implementing both, and what kind of setup for video calls we at VK.com ended up with.

What we’re going to consider here is a fairly general task. How can we distinguish human speech from the ambient noise and cut out all the extraneous sounds? The solution to this question will be useful for any kind of further processing. That’s because identification systems, acoustic event detection as well as other apps that work with human speech work better with noise-free signals. For us, such technology can help us improve the quality of video calls on VK.com, Оdnoklassniki and Mail.ru, preventing background noise from interfering with communication between call participants.

From a product perspective, the problem statement is as follows: during a call, we want to hear the people we’re talking to, not the noise produced by machinery, animals or streets.

In terms of development, we need to process а noisy audio signal in such a way as to filter out ambient noises and to amplify the target speaker’s speech.

In doing so, we impose the following requirements on the technology:

  • Real-time operation: no increase in delays by more than 20 ms.
  • Lightweight: capable of running on user devices.
  • Quality: comparable with other solutions (Zoom and Krisp) and even outclassing the competition.

Before reviewing the available approaches and delving into neural networks, let’s go through some basic concepts.

A sound is a wave. A microphone reacts to its propagation and produces an analogue, i.e. a continuous, signal. In order to be transmitted and processed, the analogue signal is transformed into a digital one. There are different hardware and software solutions for this, but the most common one is pulse-code modulation, or PCM. The frequency — the one of 44,100 Hz, for example — determines the quality of the digitized sound, and the amplitude value is usually encoded in 16 bits.

All noise filtering methods deal with a digital signal, which is a set of signal level values recorded at regular intervals.

One of the oldest known methods of speech enhancement is Spectral Subtraction. The idea of it is very simple: we subtract the noise spectrum from the noisy spectrum and get a spectrum of clear speech only, then transform it back from a spectrum to a signal.

A noisy signal can be represented as y[n] = s[n] + d[n], where s[n] is the clear speech and d[n] is the noise.

The Spectral Subtraction algorithm is mathematically described as follows:

where ŝ is the clear speech spectrum, is the assumed noise spectrum, X(w) is the input signal spectrum.

Later, a more advanced method, called the Wiener filter, gained popularity. Its idea involves minimising the distance between ŝ and s. This is formularized as follows:

where s_s(w) is the calculated spectrum, s_n(w) is the noise spectrum.

See a general scheme for speech enhancement via a Wiener Filter below.

However, Wiener estimation, like almost all strict mathematical results, is based on assumptions that are not always met in practice. Therefore, it mishandles big noises filtering and gives much fewer opportunities of adjusting the algorithm to external conditions than the machine learning algorithms that have replaced it.

Neural Networks for Noise Filtering

These days, complex mathematical transformations have been replaced by the magic of neural networks.

RNNoise became one of the first famous and successful networks solving the speech enhancement problem. RNNoise pieces together classical algorithms and an RNN network (recurrent neural network). The basic idea behind it involves using a neural network to unify the three main components of the system: VAD (Voice Activity Detection), the noise spectrum computing and subtracting the noise spectrum from the original signal.

The network picks up a noisy signal spectrum at the input, and gives back two tensors at the output. One of them predicts whether a given frame is a speech one. And by the second one the frame should be multiplied in order to get clear speech.

Of course, the entire signal sample is not processed in one pass, just a small window — in the original RNNoise, the window is 20 ms. In each subsequent iteration, the window is shifted by 10 ms, which leads to overlapping and a part of the window reanalysis.

The architecture of the RNNoise neural network itself looks something like this:

A more advanced approach was suggested by specialists from Microsoft. In NSNet, the magnitude is extracted from the frame after the STFT (Short-Time Fourier Transform) and LPS (Log-Power Spectra) transformations.

Features are computed at a 32 ms window, the overlap is 24 ms. In other words, the network returns the denoised speech in 8 ms frames.

A clear signal is augmented with noise (see data augmentation) and the resulting spectogram is fed to the input when training the network. For training, NSNet uses an RMS distance between the original clear signal spectrogram and the spectrogram obtained as a result of the neural network.

The next step on the way to speech enhancement was the DCCRN. The DCCRN has a U-net-like architecture, where all operations are represented by complex operations. The basic concept of the method is to use not only the signal magnitude, but the phases as well. This improves the quality of denoising that the neural network is capable of, compared to the previous NSNet solution.

But the DCCRN’s high quality comes at a cost. Firstly, the original network cannot operate in real-time. The authors solved this problem by inserting a recursive layer between the encoder and the decoder. Secondly, the model is difficult to implement and is not adapted to widows of arbitrary sizes. The third disadvantage of DCCRN is the model size and its operating speed. According to the authors, it takes about 3 ms to process 32 ms on an Intel i5–8250U. That is, for this network with an overlap window of 8 ms, the Real Time Factor (RT) (a metric that shows how many times faster a signal is processed than it is received) will be less than 3. On weaker hardware, real-time processing likely cannot be achieved with such parameters.

PoCoNet is an Amazon’s state-of-the-art Speech Enhancement solution that improves upon all of the previous approaches.

The PoCoNet architecture is similar to DCCRN, as it is also U-net-like. But instead of complex blocks, the authors introduce the Dense Attention Block. The neural network includes self-attention layers to capture the global context.

The main difference between PoCoNet and previous works lies in introduction of frequency positional embedding. This mechanism is similar to the work of Positional Encoding of Transformer Architecture: the input spectrogram is concatenated with positional embeddings, which are computed as follows:

Where k = 10 and depends only on frequency, and F — the frequency bandwidth, but not on time.

However, PoCoNet has significant drawbacks:

  • Slow training speed: it takes four days to train such a model on eight Tesla V100s.
  • Slow operation: 1 second of audio takes 0.65 seconds to process on a Tesla V100.

Having analysed the current works in the field of speech enhancement, we settled upon DTLN. If set up properly, it produces quality that is comparable to DCCRN and PoCoNet, while being small and having a good real-time rate.

The idea behind DTLN (Dual-Signal Transformation LSTM Network) involves 2 stages. First, as in previous works, we make a Short Time Fourier Transform (STFT), pass the magnitude to the neural network and get a vector by which we multiply the magnitude. Then we make an Inverse Fast Fourier Transform (IFFT) and send the received signal to the network’s second part of the input, at the output of which we get a completely denoised signal.

The DTLN advantages which are important to us are that the model takes up less than 4 MB, and it takes 0.65 ms to process 32 ms on an Intel i5 6600k. This is 5 times faster than DCCRN and incomparably faster than PoCoNet. We’ll consider quality metrics in more detail below, but in general, the DTLN lags behind the best models by about 10%. Taking into account the fact that it meets the rest of our requirements, that’s fine with us.

Unfortunately, the original model is incapable of processing reverberated signals (those reflected from the walls, which causes a flutter echo effect). After processing, a subjectively unpleasant rustle appears. That is why we decided to refine the model and fix the problems ourselves.

Currently, we are using a DTLN-like architecture, which is slightly heavier than the original one, but faster. Thanks to implementing it in C++, the operation speed is ~0.2 ms per 10 ms of speech. The network operates with a window of 30 ms and an overlap of 20 ms. While choosing the window size, we shifted away from the classic combination of 32 ms and 24 ms overlap to allow for more seamless integration with WebRTC and Opus.

As we all know, the end output of neural networks is highly dependent on the data used for training it. Luckily, there are some great and large open-access datasets with laid out noises and clear human speech: MUSAN, AudioSet, WHARM!, LibriVox, RIR.

For ideal reference samples, we’re going to use recordings of human speech made under ideal conditions. With the help of noise, we will prepare training samples for the neural network. In order to do this, we combine speech and noise in certain proportions, as shown in the diagram below.

We give the resulting noisy signal to the neural network for enhancement and compare the result with the reference original speech recording, gradually training the network.

Note that we don’t separate noises by type — we just use their total volume (right now we have about 900 hours of clear speech and 10,000 hours of noises for training). Since a neural network has to train on a large volume of varied data, it can eventually filter out both stationary and non-stationary noise. Right now, filtering out non-stationary noise remains a difficult task.

Training our modification of a DTLN-like network on such data takes about two days on a Tesla T4.

Most of articles use MSE as the loss function:

We have chosen SNR (Signal to noise ratio) for learning as, according to our experiments with this loss function, it allows to achieve higher quality results.

The measure of “high quality” and how it can be assessed is not a simple issue here. A simple comparison by deviation from the benchmark will not indicate whether the user experience has actually improved.

Therefore, quality metrics in speech enhancement tasks are quite complex and try to estimate the extent to which the speech in the denoised sample turned out to be intelligible. PESQ metrics are usually used for model comparison (Perceptual Evaluation of Speech Quality). They were specially designed to model subjective evaluation of the perception of human speech used in telecommunications. In some approximation, a PESQ metric should correspond to an average expert grade according to a 1 (poor) to 5 (excellent) scale. But since absolute values are very dependent on the initial test conditions, we use a metric only to determine the progress of the solution meant to be used in production.

In the table below, besides the PESQ metric itself and the neural network properties (reverberation handling as well as window and increment size) there are two characteristics that can have a significant impact on the architecture choice: the Real Time Factor and the memory.

In practice, the Real Time Factor shows the operation speed. The larger it is, the faster we process the incoming signal. Our DTLN network modification RT Factor of 45.6 means that 1 second will be processed faster than 25 ms. And the networks, as given, will need more than 80 ms for processing the same second. ROM directly affects the final size of the application and is very important when, as is in our case, noise suppression is only one of the functions.

However, a metric is only a metric — it cannot take in account absolutely everything. As an example, in a signal processed by the original DTLN which has an unpleasant whistle, the metric value is an excellent one. That is why we also check the results aurally: for example, we test the settings at work meetings and see if we can take note of the imaginary (noiseless) pets.

While working on improving the DTLN model, we tried out different ideas that sometimes can help while addressing issues in training neural networks. Some of them, listed below, turned out to be useless.

  • We experimented with loss functions, tried MSE and loss function based on quality-net. But that did not result in improving model quality. The chosen SNR function seemed to turn out as the most successful.
  • We tried to increase the number of layers in a neural network, but it did not help improve the quality either.
  • Generating data for training without adding new noises or clear speech does not lead to improvement. If one generates an extra thousand hours of noisy signals from the same raw noises and speech recordings, it will take longer to train, but will not enhance the quality of speech improvement.

Implications and Tips

Currently, our speech enhancement model works on the backend server for calls, iOS, Android, and is implemented in a native application for Windows, macOS, and Linux. On iOS and macOS, the model runs on Core ML, and we use TensorFlow Lite for other platforms.

We suggest the readers making their independent appraisal of the resulting noise supressor performance. Client noise suppression is already working on the VK app for iOS and Android, and is available for desktop in the native client for calls.

The basic recipe that you can replicate for yourselves needs the following ingredients: a DTLN network, open-source data for learning, signal-to-noise ratio as a loss function, and the ability to check the quality of noise suppression aurally. The result will make voices much better heard in recordings or audio calls.

Here are some tips that helped us with reaching better results than expected.

  • If a neural network is trained to work with reverberated signals, the sound of improved speech will be more pleasant, even if it might not show on the metrics.
  • Instead of the traditional 32 ms window size, a 30 ms window with a 20 ms overlapping is fine. There is no difference in network performance, but the integration with WebRTC and the audio codec makes it more convenient.
  • The custom speech enhancement module can be combined with built-in WebRTC to better reduce microscopic noises.
  • In general, it’s best to use WebRTC and its capabilities to the fullest. For example, WebRTC Voice Activity Detection with weak settings before noise suppression can reduce the load several times. While no one is talking, the window comes with zeros and one can just scroll through the buffer. With this configuration on the server version, we got a 3% CPU load instead of 11%.

All of this is included in the new pipeline for VK calls, which is designed for thousands of simultaneous participants. In the short run, we’re most interested in ensuring stable model performance when in production as well as feedback (you can help us there in the comments — professional experience is incredibly valuable). Later, perhaps, we will go ahead with the experiments and try, for example, to make a personalized noise suppressor using the speaker’s vector for training and better distinguishing a specific person’s voice from the noise.

About VK technologies and infrastructure