Detect boundaries of speech in audio signal using AI
specifies options using one or more name-value arguments. For example,
roi = detectspeechnn(
detectspeechnn(audioIn,fs,MergeThreshold=0.5) merges speech regions
that are separated by 0.5 seconds or less.
detectspeechnn(___) with no output arguments plots the
input signal and the detected speech regions.
This function requires both Audio Toolbox™ and Deep Learning Toolbox™.
Detect Speech in Audio Signal
Read in an audio signal containing speech and music and listen to the sound.
[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg"); sound(audioIn,fs)
detectspeechnn on the signal to obtain the regions of interest (ROIs), in samples, containing speech.
roi = detectspeechnn(audioIn,fs)
roi = 2×2 1 63120 83600 150000
Convert the ROIs from samples to seconds.
roiSeconds = (roi-1)/fs
roiSeconds = 2×2 0 3.9449 5.2249 9.3749
Plot the audio waveform with the speech regions.
Refine Speech Regions with Energy-Based VAD
Read in an audio signal containing a speaker repeating the phrase "volume up".
[audioIn,fs] = audioread("MaleVolumeUp-16-mono-6secs.ogg");
Compare detected speech regions by calling
detectspeechnn with and without the application of an energy-based voice activity detector (VAD) in postprocessing.
tiledlayout(2,1) nexttile() detectspeechnn(audioIn,fs) nexttile() detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)
Adjust Postprocessing Parameters for Detecting Speech
Read in an audio signal.
[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
detectspeechnn with no output arguments to display a plot of the detected speech regions.
Modify the parameters used in the postprocessing algorithm and see how they affect the detected speech regions. For more information about the VAD postprocessing algorithm, see Postprocessing.
mergeThreshold = 1.3 ; % seconds lengthThreshold = 0.25; % seconds activationThreshold = 0.5; % probability deactivationThreshold = 0.25 ; % probability applyEnergyVAD = false ; detectspeechnn(audioIn,fs,MergeThreshold=mergeThreshold, ... LengthThreshold=lengthThreshold, ... ActivationThreshold=activationThreshold, ... DeactivationThreshold=deactivationThreshold)
Detect Speech in Streaming Audio
detectspeechnn to detect the presence of speech in a streaming audio signal.
dsp.AudioFileReader object to stream an audio file for processing. Set the
SamplesPerFrame property to read 100 ms nonoverlapping chunks from the signal.
afr = dsp.AudioFileReader("MaleVolumeUp-16-mono-6secs.ogg"); analysisDuration = 0.1; % seconds afr.SamplesPerFrame = floor(analysisDuration*afr.SampleRate);
The neural network architecture of
detectspechnn does not retain state between calls, and it performs best when analyzing larger chunks of audio signals. When you use
detectspeechnn in a streaming scenario, specific application requirements of accuracy, computational efficiency, and latency dictate the analysis duration and whether to overlap analysis chunks.
timescope object to plot the audio signal and the detected speech regions. Create an
audioDeviceWriter to play the audio as you stream it.
scope = timescope(NumInputPorts=2, ... SampleRate=afr.SampleRate, ... TimeSpanSource="property",TimeSpan=5, ... YLimits=[-1.2,1.2], ... ShowLegend=true,ChannelNames=["Audio","Detected Speech"]); adw = audioDeviceWriter(afr.SampleRate);
In a streaming loop:
Read in a 100 ms chunk from the audio file.
detectspeechnnto detect any regions of speech in the frame. Use
sigroi2binmaskto convert the region indices to a binary mask.
Plot the audio signal and the detected speech.
Play the audio with the device writer.
while ~isDone(afr) audioIn = afr(); segments = detectspeechnn(audioIn,afr.SampleRate,LengthThreshold=0.01); mask = sigroi2binmask(segments,afr.SamplesPerFrame); scope(audioIn,mask) adw(audioIn); end
audioIn — Audio input
Audio input signal, specified as a column vector (single channel).
fs — Sample rate (Hz)
Sample rate in Hz, specified as a positive scalar.
Specify optional pairs of arguments as
the argument name and
Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name in quotes.
MergeThreshold — Merge threshold
0.25 (default) | nonnegative scalar
Merge threshold in seconds, specified as a nonnegative scalar. The function merges
speech regions that are separated by a duration less than or equal to the specified
threshold. Set the threshold to
Inf to not merge any detected
LengthThreshold — Length threshold
0.25 (default) | nonnegative scalar
Length threshold in seconds, specified as a nonnegative scalar. The function does not return speech regions that have a duration less than or equal to the specified threshold.
ActivationThreshold — Probability threshold to start a speech segment
0.5 (default) | scalar in the range [0, 1]
Probability threshold to start a speech segment, specified as a scalar in the range [0, 1].
DeactivationThreshold — Probability threshold to end a speech segment
0.25 (default) | scalar in the range [0, 1]
Probability threshold to end a speech segment, specified as a scalar in the range [0, 1].
ApplyEnergyVAD — Apply energy-based voice activity detector
false (default) |
Apply energy-based voice activity detector (VAD) to the speech regions detected by
the neural network, specified as
roi — Speech regions
Speech regions, returned as an N-by-2 matrix of indices into the input signal, where N is the number of individual speech regions detected. The first column contains the index of the start of a speech region, and the second column contains the index of the end of a region.
detectspeechnn function preprocesses the audio data using the following
Resample the audio to 16kHz.
Compute a centered short-time Fourier transform (STFT) using a 25 ms periodic Hamming window and 10 ms hop length. Pad the signal so that the first window is centered at 0 s.
Convert the STFT to a power spectrogram.
Apply a mel filter bank with 40 bands to obtain a mel spectrogram.
Convert the mel spectrogram to a log scale.
Standardize each of the mel bands to have zero mean and standard deviation of 1.
Neural Network Inference
The preprocessed data is passed to a pretrained VAD neural network. The network outputs represent the probability of speech in each frame of audio in the input spectrogram.
The neural network is a ported version of the
pretrained model provided by SpeechBrain, which combines
convolutional, recurrent, and fully connected layers.
detectspeechnn function postprocesses the VAD network output using the
Apply activation and deactivation thresholds to posterior probabilities to determine candidate speech regions.
Optionally, apply energy-based VAD to refine the detected speech regions.
Merge speech regions that are close to each other according to the merge threshold.
Remove speech regions that are shorter than or equal to the length threshold.
 Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624
Introduced in R2023a