http://blog.wolfram.com/2018/05/24/learning-to-listen-neural-networks-application-for-recognizing-speech/

Learning to Listen: Neural Networks Application for Recognizing Speech

May 24, 2018 — Carlo Giacometti, Kernel Developer, Algorithms R&D

Introduction

Recognizing words is one of the simplest tasks a human can do, yet it has proven extremely difficult for machines to achieve similar levels of performance. Things have changed dramatically with the ubiquity of machine learning and neural networks, though: the performance achieved by modern techniques is dramatically higher compared with the results from just a few years ago. In this post, I’m excited to show a reduced but practical and educational version of the speech recognition problem—the assumption is that we’ll consider only a limited set of words. This has two main advantages: first of all, we have easy access to a dataset through the Wolfram Data Repository (the Spoken Digit Commands dataset), and, maybe most importantly, all of the classifiers/networks I’ll present can be trained in a reasonable time on a laptop.

It’s been about two years since the initial introduction of the Audio object into the Wolfram Language, and we are thrilled to see so many interesting applications of it. One of the main additions to Version 11.3 of the Wolfram Language was tight integration of Audio objects into our machine learning and neural net framework, and this will be a cornerstone in all of the examples I’ll be showing today.

Without further ado, let’s squeeze out as much information as possible from the Spoken Digit Commands dataset!

Spoken Digit Commands dataset

The Data

Let’s get started by accessing and inspecting the dataset a bit:

&#10005

ro=ResourceObject["Spoken Digit Commands"]

The dataset is a subset of the Speech Commands dataset released by Google. We wanted to have a “spoken MNIST,” which would let us produce small, self-enclosed examples of machine learning on audio signals. Since the Spoken Digit Commands dataset is a ResourceObject, it’s easy to get all the training and testing data within the Wolfram Language:

&#10005

trainingData=ResourceData[ro,"TrainingData"];
testingData=ResourceData[ro,"TestData"];
RandomSample[trainingData,3]//Dataset

One important thing we made sure of is that the speakers in the training and testing sets are different. This means that in the testing phase, the trained classifier/network will encounter speakers that it has never heard before.

&#10005

Intersection[trainingData[[All,"SpeakerID"]],testingData[[All,"SpeakerID"]]]

The possible output values are the digits from 0 to 9:

&#10005

classes=Union[trainingData[[All,"Output"]]]

Conveniently, the length of all the input data is between .5 and 1 seconds, with the majority for the signals being one second long:

&#10005

Dataset[trainingData][Histogram[#,ScalingFunctions->"Log"]&@*Duration,"Input"]

Encoders

In Version 11.3, we built a collection of audio encoders in NetEncoder and properly integrated it into the rest of the machine learning and neural net framework. Now we can seamlessly extract features from a large collection of audio recordings; inject them into a net; and train, test and evaluate networks for a variety of applications.

Since there are multiple features that one might want to extract from an audio signal, we decided that it was a good idea to have one encoder per feature rather than a single generic "Audio" one. Here is the full list:

• "Audio"
• "AudioSTFT"
• "AudioSpectrogram"
• "AudioMelSpectrogram"
• "AudioMFCC"

The first step (which is common in all encoders) is the preprocessing: the signal is reduced to a single channel, resampled to a fixed sample rate and can be padded or trimmed to a specified duration.

The simplest one is NetEncoder["Audio"], which just returns the raw waveform:

&#10005

encoder=NetEncoder["Audio"]
&#10005

encoder[RandomChoice[trainingData]["Input"]]//Flatten//ListLinePlot

The starting point for all of the other audio encoders is the short-time Fourier transform, where the signal is partitioned in (potentially overlapping) chunks, and the Fourier transform is computed on each of them. This way we can get both time (since each chunk is at a very specific time) and frequency (thanks to the Fourier transform) information. We can visualize this process by using the Spectrogram function:

&#10005

a=AudioGenerator[{"Sin",TimeSeries[{{0,1000},{1,4000}}]},2];
Spectrogram[a]

The main parameters for this operation that are common to all of the frequency domain features are WindowSize and Offset, which control the sizes of the chunks and their offsets.

Each NetEncoder supports the "TargetLength" option. If this is set to a specific number, the input audio will be trimmed or padded to the correct duration; otherwise, the length of the output of the NetEncoder will depend on the length of the original signal.

For the scope of this blog post, I’ll be using the "AudioMFCC" NetEncoder, since it is a feature that packs a lot of information about the signal while keeping the dimensionality low:

&#10005

encoder=NetEncoder[{"AudioMFCC","TargetLength"->All,"SampleRate"->16000,"WindowSize" -> 1024,"Offset"-> 570,"NumberOfCoefficients"->28,"Normalization"->True}]
encoder[RandomChoice[trainingData]["Input"]]//Transpose//MatrixPlot

SEE

 

Leave a comment