For the past few weeks, I’ve been working on the TensorFlow Speech Recognition Challenge on Kaggle. The task is to recognize a one-second audio clip, where the clip contains one of a small number of words, like “yes”, “no”, “stop”, “go”, “left”, and “right”.
In general, speech recognition is a difficult problem, but it’s much easier when the vocabulary is limited to a handful of words. We don’t need to use complicated language models to detect phonemes, and then string the phonemes into words, like Kaldi does for speech recognition. Instead, a convolutional neural network works quite well.
First Steps
The dataset consists of about 64000 audio files which have already been split into training / validation / testing sets. You are then asked to make predictions on about 150000 audio files for which the labels are unknown.
Actually, this dataset had already been published in academic literature, and people published code to solve the same problem. I started with GCommandPytorch by Yossi Adi, which implements a speech recognition CNN in Pytorch.
The first step that it does is convert the audio file into a spectrogram, which is an image representation of sound. This is easily done using LibRosa.
Above: Sample spectrograms of “yes” and “no”
Now we’ve converted the problem to an image classification problem, which is well studied. To an untrained human observer, all the spectrograms may look the same, but neural networks can learn things that humans can’t. Convolutional neural networks work very well for classifying images, for example VGG16:
Above: A Convolutional Neural Network (LeNet). VGG16 is similar, but has even more layers.
For more details about this approach, refer to these papers:
- Convolutional Neural Networks for Small-footprint Keyword Spotting
-
Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting
Voice Activity Detection
You might ask: if somebody already implemented this, then what’s there left to do other than run their code? Well, the test data contains “silence” samples, which contain background noise but no human speech. It also has words outside the set we care about, which we need to label as “unknown”. The Pytorch CNN produces about 95% validation accuracy by itself, but the accuracy is much lower when we add these two additional requirements.
For silence detection, I first tried the simplest thing I could think of: taking the maximum absolute value of the waveform and decide it’s “silence” if the value is below a threshold. When combined with VGG16, this gets accuracy 0.78 on the leaderboard. This is a crude metric because sufficiently loud noise would be considered speech.
Next, I tried running openSMILE, which I use in my research to extract various acoustic features from audio. It implements an LSTM for voice activity detection: every 0.05 seconds, it outputs a probability that someone is talking. Combining the openSMILE output with the VGG16 prediction gave a score of 0.81.
More improvements
I tried a bunch of things to improve my score:
- Fiddled around with the neural network hyperparameters which boosted my score to 0.85. Each epoch took about 10 minutes on a GPU, and the whole model takes about 2 hours to train. Somehow, Adam didn’t produce good results, and SGD with momentum worked better.
- Took 100% of the data for training and used the public LB for validation (don’t do this in real life lol). This improved my score to 0.86.
- Trained an ensemble 3 versions of the same neural network with same hyperparameters but different randomly initialized weights and took a majority vote to do prediction. This improved the score to 0.87. I would’ve liked to train more, but other people in my research group needed to use the GPUs.
In the end, the top scoring model had a score of 0.91, which beat my model by 4 percentage points. Although not enough to win a Kaggle medal, my model was in the top 15% of all submissions. Not bad!
My source code for the contest is available here.