Classify Sound Using Deep Learning

This example uses:

This example shows how to classify a sound by using deep learning processes.

Create a Data Set

Generate 1000 white noise signals, 1000 brown noise signals, and 1000 pink noise signals. Each signal represents a duration of 0.5 seconds, assuming a 44.1 kHz sample rate.

fs = 44.1e3;
duration = 0.5;
N = duration*fs;

wNoise = 2*rand([N,1000]) - 1;
wLabels = repelem(categorical("white"),1000,1);

bNoise = filter(1,[1,-0.999],wNoise);
bNoise = bNoise./max(abs(bNoise),[],'all');
bLabels = repelem(categorical("brown"),1000,1);

pNoise = pinknoise([N,1000]);
pLabels = repelem(categorical("pink"),1000,1);

classNames = ["white", "brown", "pink"];

Explore the Data Set

Listen to a white noise signal and visualize it using the melSpectrogram function.

sound(wNoise(:,1),fs)
melSpectrogram(wNoise(:,1),fs)
title('White Noise')

Figure contains an axes object. The axes object with title White Noise, xlabel Time (ms), ylabel Frequency (kHz) contains an object of type image.

Inspect a brown noise signal.

sound(bNoise(:,1),fs)
melSpectrogram(bNoise(:,1),fs)
title('Brown Noise')

Figure contains an axes object. The axes object with title Brown Noise, xlabel Time (ms), ylabel Frequency (kHz) contains an object of type image.

Inspect a pink noise signal.

sound(pNoise(:,1),fs)
melSpectrogram(pNoise(:,1),fs)
title('Pink Noise')

Figure contains an axes object. The axes object with title Pink Noise, xlabel Time (ms), ylabel Frequency (kHz) contains an object of type image.

Separate the Data Set into Train and Validation Sets

Create a training set that consists of 800 of the white noise signals, 800 of the brown noise signals, and 800 of the pink noise signals.

audioTrain = [wNoise(:,1:800),bNoise(:,1:800),pNoise(:,1:800)];
labelsTrain = [wLabels(1:800);bLabels(1:800);pLabels(1:800)];

Create a validation set using the remaining 200 white noise signals, 200 brown noise signals, and 200 pink noise signals.

audioValidation = [wNoise(:,801:end),bNoise(:,801:end),pNoise(:,801:end)];
labelsValidation = [wLabels(801:end);bLabels(801:end);pLabels(801:end)];

Extract Features

Audio data is highly dimensional and typically contains redundant information. You can reduce the dimensionality by first extracting features and then training your model using the extracted features. Create an audioFeatureExtractor object to extract the centroid and slope of the mel spectrum over time.

aFE = audioFeatureExtractor(SampleRate=fs, ...
    SpectralDescriptorInput="melSpectrum", ...
    spectralCentroid=true, ...
    spectralSlope=true);

Call extract to extract the features from the audio training data.

featuresTrain = extract(aFE,audioTrain);
[numHopsPerSequence,numFeatures,numSignals] = size(featuresTrain)

numHopsPerSequence = 42

numFeatures = 2

numSignals = 2400

Extract the validation features.

featuresValidation = extract(aFE,audioValidation);
featuresValidation = squeeze(num2cell(featuresValidation,[1,2]));

Define and Train the Network

Define the network architecture. See List of Deep Learning Layers (Deep Learning Toolbox) for more information.

layers = [ ...
    sequenceInputLayer(numFeatures)
    lstmLayer(50,OutputMode="last")
    fullyConnectedLayer(numel(unique(labelsTrain)))
    softmaxLayer];

To define the training options, use trainingOptions (Deep Learning Toolbox).

options = trainingOptions("adam", ...
    Shuffle="every-epoch", ...
    ValidationData={featuresValidation,labelsValidation}, ...
    Plots="training-progress", ...
    Metrics="accuracy", ...
    Verbose=false);

To train the network, use trainnet.

net = trainnet(featuresTrain,labelsTrain,layers,"crossentropy",options);

Test the Network

Use the trained network to classify new white noise, brown noise, and pink noise signals.

wNoiseTest = 2*rand([N,1]) - 1;
scores = predict(net,extract(aFE,wNoiseTest));
scores2label(scores,classNames)

ans = categorical
     white

bNoiseTest = filter(1,[1,-0.999],wNoiseTest);
bNoiseTest= bNoiseTest./max(abs(bNoiseTest),[],'all');
scores = predict(net,extract(aFE,bNoiseTest));
scores2label(scores,classNames)

ans = categorical
     brown

pNoiseTest = pinknoise(N);
scores = predict(net,extract(aFE,pNoiseTest));
scores2label(scores,classNames)

ans = categorical
     pink