Machine Learning and Deep Learning for Audio

Dataset management, labeling, and augmentation; segmentation and feature extraction for audio, speech, and acoustic applications

Audio Toolbox™ provides functionality to develop audio, speech, and acoustic applications using machine learning and deep learning. Use audioDatastore to manage and load large data sets. Use Audio Labeler to interactively define and visualize ground truth. Use audioDataAugmenter to enlarge data sets using audio-specific augmentation techniques. Use audioFeatureExtractor to create efficient and modular feature extraction pipelines.

Apps

Audio LabelerDefine and visualize ground-truth labels

Live Editor Tasks

Extract Audio FeaturesStreamline audio feature extraction in the Live Editor

Functions

expand all

audioDatastoreDatastore for collection of audio files
mfccExtract mfcc, log energy, delta, and delta-delta of audio signal
gtccExtract gammatone cepstral coefficients, log-energy, delta, and delta-delta
cepstralFeatureExtractorExtract cepstral features from audio segment
audioDataAugmenterAugment audio data
audioTimeScalerApply time scaling to streaming audio
shiftPitchShift audio pitch
stretchAudioTime-stretch audio
erb2hzConvert from equivalent rectangular bandwidth (ERB) scale to hertz
bark2hzConvert from Bark scale to hertz
mel2hzConvert from mel scale to hertz
hz2erbConvert from hertz to equivalent rectangular bandwidth (ERB) scale
hz2barkConvert from hertz to Bark scale
hz2melConvert from hertz to mel scale
phon2soneConvert from phon to sone
sone2phonConvert from sone to phon
designAuditoryFilterBankDesign auditory filter bank
integratedLoudnessMeasure integrated loudness and loudness range
loudnessMeterStandard-compliant loudness measurements
harmonicRatioHarmonic ratio
pitchEstimate fundamental frequency of audio signal
detectSpeechDetect boundaries of speech in audio signal
voiceActivityDetectorDetect presence of speech in audio signal
audioFeatureExtractorStreamline audio feature extraction
spectralCentroidSpectral centroid for audio signals and auditory spectrograms
spectralCrestSpectral crest for audio signals and auditory spectrograms
spectralDecreaseSpectral decrease for audio signals and auditory spectrograms
spectralEntropySpectral entropy for audio signals and auditory spectrograms
spectralFlatnessSpectral flatness for audio signals and auditory spectrograms
spectralFluxSpectral flux for audio signals and auditory spectrograms
spectralKurtosisSpectral kurtosis for audio signals and auditory spectrograms
spectralRolloffPointSpectral rolloff point for audio signals and auditory spectrograms
spectralSkewnessSpectral skewness for audio signals and auditory spectrograms
spectralSlopeSpectral slope for audio signals and auditory spectrograms
spectralSpreadSpectral spread for audio signals and auditory spectrograms
melSpectrogramMel spectrogram
kbdwinKaiser-Bessel-derived window
mdctModified discrete cosine transform
imdctInverse modified discrete cosine transform

Blocks

Voice Activity DetectorDetect presence of speech in audio signal
Cepstral Feature ExtractorExtract cepstral features from audio segment
Loudness MeterStandard-compliant loudness measurements

Topics

Label Audio Using Audio Labeler

Interactively define and visualize ground-truth labels for audio datasets.

Speech-to-Text Transcription

Perform speech-to-text transcription in MATLAB® using third-party cloud-based APIs.

Text-to-Speech Conversion

Perform text-to-speech conversion in MATLAB using third-party cloud-based APIs.

Spectral Descriptors

Overview and applications of spectral descriptors.

Featured Examples

Speaker Verification Using i-Vectors

Speaker Verification Using i-Vectors

Speaker verification, or authentication, is the task of confirming that the identity of a speaker is who they purport to be. Speaker verification has been an active research area for many years. An early performance breakthrough was to use a Gaussian mixture model and universal background model (GMM-UBM) [1] on acoustic features (usually mfcc). For an example, see Speaker Verification Using Gaussian Mixture Models. One of the main difficulties of GMM-UBM systems involves intersession variability. Joint factor analysis (JFA) was proposed to compensate for this variability by separately modeling inter-speaker variability and channel or session variability [2] [3]. However, [4] discovered that channel factors in the JFA also contained information about the speakers, and proposed combining the channel and speaker spaces into a total variability space. Intersession variability was then compensated for by using backend procedures, such as linear discriminant analysis (LDA) and within-class covariance normalization (WCCN), followed by a scoring, such as the cosine similarity score. [5] proposed replacing the cosine similarity scoring with a probabilistic LDA (PLDA). While i-vectors were originally proposed for speaker verification, they have been applied to many problems, like language recognition, speaker diarization, emotion recognition, age estimation, and anti-spoofing [10]. Recently, deep learning techniques have been proposed to replace i-vectors with d-vectors or x-vectors [8] [6].