Main Content

Keyword Spotting in Noise Code Generation with Intel MKL-DNN

This example demonstrates code generation for keyword spotting using a Bidirectional Long Short-Term Memory (BiLSTM) network and mel frequency cepstral coefficient (MFCC) feature extraction. MATLAB® Coder™ with Deep Learning Support enables the generation of a standalone executable (.exe) file. Communication between the MATLAB® (.mlx) file and the generated executable file occurs over asynchronous User Datagram Protocol (UDP). The incoming speech signal is displayed using a timescope. A mask is shown as a blue rectangle surrounding spotted instances of the keyword, YES. For more details on MFCC feature extraction and deep learning network training, visit Keyword Spotting in Noise Using MFCC and LSTM Networks.

Example Requirements

  • MATLAB® Coder Interface for Deep Learning Support Package

  • Intel® Xeon® processor with support for Intel Advanced Vector Extensions 2 (Intel AVX2)

  • Intel Math Kernel Library for Deep Neural Networks (MKL-DNN)

  • Environment variables for Intel MKL-DNN

For supported versions of libraries and for information about setting up environment variables, see Prerequisites for Deep Learning with MATLAB Coder (MATLAB Coder).

Pretrained Network Keyword Spotting Using MATLAB and Streaming Audio from Microphone

The sample rate of the pretrained network is 16 kHz. Set the window length to 512 samples, with an overlap length of 384 samples, and a hop length defined as the difference between the window and overlap lengths. Define the rate at which the mask is estimated. A mask is generated once for every numHopsPerUpdate audio frames.

fs = 16e3;
windowLength = 512;
overlapLength = 384;
hopLength = windowLength - overlapLength;
numHopsPerUpdate = 16;
maskLength = hopLength*numHopsPerUpdate;

Create an audioFeatureExtractor object to perform MFCC feature extraction.

afe = audioFeatureExtractor('SampleRate',fs, ...
                            'Window',hann(windowLength,'periodic'), ...
                            'OverlapLength',overlapLength, ...
                            'mfcc',true, ...
                            'mfccDelta',true, ...
                            'mfccDeltaDelta',true);   

Download and load the pretrained network, as well as the mean (M) and the standard deviation (S) vectors used for Feature Standardization.

url = 'http://ssd.mathworks.com/supportfiles/audio/KeywordSpotting.zip';
downloadNetFolder = './';
netFolder = fullfile(downloadNetFolder,'KeywordSpotting');
if ~exist(netFolder,'dir')
    disp('Downloading pretrained network and audio files (4 files - 7 MB) ...')
    unzip(url,downloadNetFolder)
end
load(fullfile(netFolder,'KWSNet.mat'),"KWSNet","M","S");

Call generateMATLABFunction on the audioFeatureExtractor object to create the feature extraction function. You will use this function in the processing loop.

generateMATLABFunction(afe,'generateKeywordFeatures','IsStreaming',true);

Define an Audio Device Reader that can read audio from your microphone. Set the frame length equal to the hop length. This enables you to compute a new set of features for every new audio frame from the microphone.

frameLength = hopLength;
adr = audioDeviceReader('SampleRate',fs, ...
                        'SamplesPerFrame',frameLength);

Create a Time Scope to visualize the speech signals and estimated mask.

scope = timescope('SampleRate',fs, ...
                  'TimeSpanSource','property', ...
                  'TimeSpan',5, ...
                  'TimeSpanOverrunAction','Scroll', ...
                  'BufferLength',fs*5*2, ...
                  'ShowLegend',true, ...
                  'ChannelNames',{'Speech','Keyword Mask'}, ...
                  'YLimits',[-1.2 1.2], ...
                  'Title','Keyword Spotting');

Initialize a buffer for the audio data, a buffer for the computed features, and a buffer to plot the input audio and the output speech mask.

dataBuff = dsp.AsyncBuffer(windowLength);
featureBuff = dsp.AsyncBuffer(numHopsPerUpdate);
plotBuff = dsp.AsyncBuffer(numHopsPerUpdate*windowLength);

Perform keyword spotting on speech received from your microphone. To run the loop indefinitely, set timeLimit to Inf. To stop the simulation, close the scope.

timeLimit = 20;
show(scope);
tic
while toc < timeLimit && isVisible(scope)
    
    data = adr();
    write(dataBuff,data);
    write(plotBuff,data);  
        
    frame = read(dataBuff,windowLength,overlapLength);
    features = generateKeywordFeatures(frame,fs);
    write(featureBuff,features.');

    if featureBuff.NumUnreadSamples == numHopsPerUpdate
        
        featureMatrix = read(featureBuff);
        featureMatrix(~isfinite(featureMatrix)) = 0;       
        featureMatrix = (featureMatrix - M)./S;
                
        [keywordNet, v] = classifyAndUpdateState(KWSNet,featureMatrix.');
                
        v = double(v) - 1;
        v = repmat(v,hopLength,1);
        v = v(:);
        v = mode(v);
        predictedMask = repmat(v,numHopsPerUpdate*hopLength,1);
        
        data = read(plotBuff);        
        scope([data,predictedMask]);
        
        drawnow limitrate;
    end
end

release(adr)
hide(scope)

The helperKeywordSpotting supporting function encapsulates capturing the audio, feature extraction and network prediction process demonstrated previously. To make feature extraction compatible with code generation, feature extraction is handled by the generated generateKeywordFeatures function. To make the network compatible with code generation, the supporting function uses the coder.loadDeepLearningNetwork (MATLAB Coder) (MATLAB Coder) function to load the network.

The supporting function uses a dsp.UDPSender System object to send the input data along with the output mask predicted by the network to MATLAB. The MATLAB script uses the dsp.UDPReceiver System object to receive the input data along with the output mask predicted by the network running in the supporting function.

Generate Executable on Desktop

Create a code generation configuration object to generate an executable. Specify the target language as C++.

cfg = coder.config('exe');
cfg.TargetLang = 'C++';

Create a configuration object for deep learning code generation with the MKL-DNN library. Attach the deep learning configuration object to the code generation configuration object.

dlcfg = coder.DeepLearningConfig('mkldnn');
cfg.DeepLearningConfig = dlcfg;

Generate the C++ main file required to produce the standalone executable.

cfg.GenerateExampleMain = 'GenerateCodeAndCompile';

Generate helperKeywordSpotting, a supporting function that encapsulates the audio capture, feature extraction, and network prediction processes. You get a warning in the code generation logs that you can disregard because helperKeywordSpotting has an infinite loop that continuously looks for an audio frame from MATLAB.

codegen helperKeywordSpotting -config cfg -report
Warning: Function 'helperKeywordSpotting' does not terminate due to an infinite loop.

Warning in ==> helperKeywordSpotting Line: 73 Column: 1
Code generation successful (with warnings): View report

Prepare Dependencies and Run the Generated Executable

In this section, you generate all the required dependency files and put them into a single folder. During the build process, MATLAB Coder generates buildInfo.mat, a file that contains the compilation and run-time dependency information for the standalone executable.

Set the project name to helperKeywordSpotting.

projName = 'helperKeywordSpotting';
packageName = [projName,'Package'];
if ispc
    exeName = [projName,'.exe'];
else
    exeName = projName;
end

Load buildinfo.mat and use packNGo (MATLAB Coder) to produce a .zip package.

load(['codegen',filesep,'exe',filesep,projName,filesep,'buildInfo.mat']);
packNGo(buildInfo,'fileName',[packageName,'.zip'],'minimalHeaders',false);

Unzip the package and place the executable file in the unzipped directory.

unzip([packageName,'.zip'],packageName);
copyfile(exeName, packageName,'f');

To invoke a standalone executable that depends on the MKL-DNN Dynamic Link Library, append the path to the MKL-DNN library location to the environment variable PATH.

setenv('PATH',[getenv('INTEL_MKLDNN'),filesep,'lib',pathsep,getenv('PATH')]);

Run the generated executable.

if ispc
    system(['start cmd /k "title ',packageName,' && cd ',packageName,' && ',exeName]);
else
    cd(packageName);
    system(['./',exeName,' &']);
    cd ..;
end

Perform Keyword Spotting Using Deployed Code

Create a dsp.UDPReceiver System object to receive speech data and the predicted speech mask from the standalone executable. Each UDP packet received from the executable consists of maskLength mask samples and speech samples. The maximum message length for the dsp.UDPReceiver object is 65507 bytes. Calculate the buffer size to accommodate the maximum number of UDP packets.

sizeOfFloatInBytes = 4;
speechDataLength = maskLength; 
numElementsPerUDPPacket = maskLength + speechDataLength;

maxUDPMessageLength = floor(65507/sizeOfFloatInBytes);
samplesPerPacket = 1 + numElementsPerUDPPacket; 
numPackets = floor(maxUDPMessageLength/samplesPerPacket);
bufferSize = numPackets*samplesPerPacket*sizeOfFloatInBytes;

UDPReceive = dsp.UDPReceiver('LocalIPPort',20000, ...
    'MessageDataType','single', ...
    'MaximumMessageLength',samplesPerPacket, ...
    'ReceiveBufferSize',bufferSize);

To run the keyword spotting indefinitely, set timelimit to Inf. To stop the simulation, close the scope.

tic;
timelimit = 20;
show(scope);

while toc < timelimit && isVisible(scope)
    data = UDPReceive();
    if ~isempty(data)
        plotMask = data(1:maskLength);
        plotAudio = data(maskLength+1 : maskLength+speechDataLength);
        scope([plotAudio,plotMask]);
    end
    drawnow limitrate;
end

hide(scope);

Release the system objects and terminate the standalone executable.

release(UDPReceive);
release(scope);
if ispc
    system(['taskkill /F /FI "WindowTitle eq ',projName,'* " /T']);
else
    system(['killall ',exeName]);
end
SUCCESS: The process with PID 4644 (child process of PID 21188) has been terminated. 
SUCCESS: The process with PID 20052 (child process of PID 21188) has been terminated. 
SUCCESS: The process with PID 21188 (child process of PID 22940) has been terminated. 

Evaluate Execution Time Using Alternative MEX Function Workflow

A similar workflow involves using a MEX file instead of the standalone executable. Perform MEX profiling to measure the computation time for the workflow.

Create a code generation configuration object to generate the MEX function. Specify the target language as C++.

cfg = coder.config('mex');
cfg.TargetLang = 'C++';

Create a configuration object for deep learning code generation with the MKL-DNN library. Attach the deep learning configuration object to the code generation configuration object.

dlcfg = coder.DeepLearningConfig('mkldnn');
cfg.DeepLearningConfig = dlcfg;

Call codegen to generate the MEX function for profileKeywordSpotting.

inputAudioFrame = ones(hopLength,1,'single');
codegen profileKeywordSpotting -config cfg -args {inputAudioFrame} -report
Code generation successful: View report

Measure the execution time of the MATLAB code.

x = pinknoise(hopLength,1,'single');
numPredictCalls = 100;
totalNumCalls = numPredictCalls*numHopsPerUpdate;
exeTimeStart = tic;
for call = 1:totalNumCalls
    [outputMask,inputData,plotFlag] = profileKeywordSpotting(x);
end
exeTime = toc(exeTimeStart);
fprintf('MATLAB execution time per %d ms of audio = %0.4f ms\n',int32(1000*numHopsPerUpdate*hopLength/fs),(exeTime/numPredictCalls)*1000);
MATLAB execution time per 128 ms of audio = 24.9238 ms

Measure the execution time of the MEX function.

exeTimeMexStart = tic; 
for call = 1:totalNumCalls
    [outputMask,inputData,plotFlag] = profileKeywordSpotting_mex(x);
end
exeTimeMex = toc(exeTimeMexStart);
fprintf('MEX execution time per %d ms of audio = %0.4f ms\n',int32(1000*numHopsPerUpdate*hopLength/fs),(exeTimeMex/numPredictCalls)*1000);
MEX execution time per 128 ms of audio = 5.2710 ms

Compare total execution time of the standalone executable approach with the MEX function approach. This performance test is done on a machine using an NVIDIA Quadro® P620 (Version 26) GPU and an Intel Xeon W-2133 CPU running at 3.60 GHz.

PerformanceGain = exeTime/exeTimeMex
PerformanceGain = 4.7285