separateSpeakers

Separate signal by speakers

Since R2023b

collapse all in page

Syntax

y = separateSpeakers(audioIn,fs)

y = separateSpeakers(audioIn,fs,Name=Value)

[y,r] = separateSpeakers(___)

separateSpeakers(___)

Description

y = separateSpeakers(audioIn,fs) separates audio with overlapping speech into signals containing the isolated speech of each speaker.

This function requires Deep Learning Toolbox™.

example

y = separateSpeakers(audioIn,fs,Name=Value) specifies options using one or more name-value arguments. For example, separateSpeakers(audioIn,fs,NumSpeakers=3) separates a speech signal that is known to contain three speakers.

example

[y,r] = separateSpeakers(___) also returns the residual signal after performing iterative speaker separation. Use this syntax in combination with any of the input arguments in previous syntaxes. This syntax does not apply if NumSpeakers is 2 or 3.

example

separateSpeakers(___) with no output arguments plots the input signal and the separated speaker signals. This function also plots the residual signal if NumSpeakers is unspecified or set to 1.

example

Examples

collapse all

Download `separateSpeakers` Functionality

Open Live Script

Try calling separateSpeakers in the command line. If the required model files are not installed, then the function throws an error and provides a link to download them. Click the link, and unzip the file to a location on the MATLAB® path.

Alternatively, execute the following commands to download and unzip the separateSpeakers model files to your temporary directory.

downloadFolder = fullfile(tempdir,"separateSpeakerDownload");
loc = websave(downloadFolder,"https://ssd.mathworks.com/supportfiles/audio/separateSpeakers.zip");
modelsLocation = tempdir;
unzip(loc,modelsLocation)
addpath(fullfile(modelsLocation,"separateSpeakers"))

Separate Unknown Number of Speakers from Signal

This example uses:

Open Live Script

Create an audio signal that combines the speech of two speakers. Listen to the mixed signal.

[s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
s = sum(s(:,1:2),2);
x = s./max(abs(s));
sound(x,fs)

Call separateSpeakers to separate the individual speakers from the signal. By default, separateSpeakers estimates how many speakers to separate from the input. Inspect the output dimensions to see how many speakers the function separates. In this case, separateSpeakers correctly detects two different speakers.

y = separateSpeakers(x,fs);
size(y)

ans = 1×2

       40000           2

Listen to the first separated speaker.

sound(y(:,1),fs)

Listen to the second separated speaker.

sound(y(:,2),fs)

Call separateSpeakers with no output arguments to plot the input signal, the separated signals, the residual, and the reconstructed input.

separateSpeakers(x,fs)

Separate Known Number of Speakers from Signal

This example uses:

Open Live Script

Create an audio signal that combines the speech of three speakers with different scaling factors. Listen to the mixed signal.

[s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
x = sum(s(:,1:3).*[1,0.5,0.1],2);
x = x./max(abs(x));
sound(x,fs)

Call separateSpeakers with NumSpeakers set to 3 to separate the three known speakers from the signal.

y = separateSpeakers(x,fs,NumSpeakers=3);

Listen to the first separated speaker.

sound(y(:,1),fs)

Listen to the second separated speaker.

sound(y(:,2),fs)

Listen to the third separated speaker.

sound(y(:,3),fs)

Call separateSpeakers with no output arguments to plot the input signal, the separated signals, the residual, and the reconstructed input.

separateSpeakers(x,fs,NumSpeakers=3)

Get Residual from Speech Separation

This example uses:

Open Live Script

If you do not specify NumSpeakers, or if you specify NumSpeakers as 1, separateSpeakers also returns the residual signal. The residual is the part of the original signal that is "left over" after separating out the speaker or speakers.

Create an audio signal that combines the speech of two speakers. Listen to the mixed signal.

[s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
s = sum(s(:,1:2),2);
x = s./max(abs(s));
sound(x,fs)

Call separateSpeakers with NumSpeakers set to 1 to separate a single speaker signal from the input. Specify an additional output argument r to obtain the residual. Listen to the separated speaker.

[y,r] = separateSpeakers(x,fs,NumSpeakers=1);
sound(y,fs)

Listen to the residual signal.

sound(r,fs)

Call detectspeechnn with no output arguments to plot the detected speech in the residual.

detectspeechnn(r,fs)

Separate Speakers Without Conserving Energy of Input

This example uses:

Open Live Script

Create an audio signal that combines the speech of two speakers. Listen to the mixed signal.

[s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
s = sum(s(:,1:2),2);
x = s./max(abs(s));
sound(x,fs)

Call separateSpeakers with ConserveEnergy set to false to simply scale the output signals to have a maximum absolute value of 1. Call the function with no output arguments to plot the signals.

separateSpeakers(x,fs,ConserveEnergy=false)

Input Arguments

collapse all

`audioIn` — Audio input
column vector

Audio input, specified as a column vector (single channel).

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate in Hz, specified as a positive scalar.

Data Types: single | double

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: separateSpeakers(audioIn,fs,NumSpeakers=3,ConserveEnergy=false)

`NumSpeakers` — Number of speakers to separate
`1` | `2` | `3`

Number of speakers to separate, specified as 1, 2, or 3. If you do not specify NumSpeakers, separateSpeakers estimates the number of speakers. For more information, see One-And-Rest Speech Separation.

Data Types: single | double

`ConserveEnergy` — Scale outputs to conserve input energy
`true` (default) | `false`

Scale the output signals to conserve the input energy, specified as true or false.

If ConserveEnergy is true, separateSpeakers attempts to scale the output signals so that their sum reconstructs the input signal. The energy conservation algorithm includes the residual signal r if NumSpeakers is unspecified or set to 1.
If ConserveEnergy is false, separateSpeakers scales each speaker signal and the residual to have a maximum absolute value of 1.

Data Types: logical

Output Arguments

collapse all

`y` — Audio signal separated by speakers
N-by-C matrix

Audio signal separated by speakers, returned as an N-by-C matrix that contains an individual speaker signal in each column.

N is the length of the input signal audioIn in samples.
C is the number of speakers. You can define the number of speakers by specifying NumSpeakers. Otherwise, the function estimates the number of speakers through One-And-Rest Speech Separation.

The separated speaker signals have the same sample rate as the input signal.

Data Types: single

`r` — Residual signal
N-by-1 vector

Residual signal from iterative speaker separation, returned as an N-by-1 vector, where N is the length of the input signal audioIn in samples. For more information about iterative speaker separation, see One-And-Rest Speech Separation.

The separateSpeakers function does not return a residual if you specify NumSpeakers as 2 or 3.

Data Types: single

Algorithms

collapse all

The separateSpeakers function uses a pretrained deep learning model to separate the individual speaker signals. The model that it uses depends on the NumSpeakers argument.

If you do not specify NumSpeakers, separateSpeakers uses a Conv-TasNet [1] model with "one-and-rest" iterative speaker separation [4].
If you set NumSpeakers to 1, separateSpeakers uses a Conv-TasNet model with one iteration of one-and-rest separation.
If you set NumSpeakers to 2, separateSpeakers uses a SepFormer [3] model trained to output two speaker signals. This neural network uses pretrained weights from the sepformer-libri2mix model provided by SpeechBrain [2].
If you set NumSpeakers to 3, separateSpeakers uses a SepFormer model trained to output three speaker signals. This neural network uses pretrained weights from the sepformer-libri3mix model provided by SpeechBrain [2].

One-And-Rest Speech Separation

The separateSpeakers function can separate speech from a signal with an unknown number of speakers using a model that is trained to perform one-and-rest speech separation. In one-and-rest speech separation, the model takes a mixed speech signal and returns two signals: the speech of one individual speaker and the "rest" of the signal, which is the residual of the original signal after separating out the one speaker.

The function uses one-and-rest speech separation iteratively. First, it separates the mixed speech signal into one speaker and the residual. Then, it uses voice activity detection (VAD) to determine if the residual contains more speakers. If it does not detect speech in the residual, the function stops and returns the separated speakers. Otherwise, it repeats the process and performs one-and-rest speech separation on the residual signal.

References

[1] Luo, Yi, and Nima Mesgarani. “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, Aug. 2019, pp. 1256–66. DOI.org (Crossref), https://doi.org/10.1109/TASLP.2019.2915167.

[2] Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624

[3] Subakan, Cem, et al. “Attention Is All You Need In Speech Separation.” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 21–25. DOI.org (Crossref), https://doi.org/10.1109/ICASSP39728.2021.9413901.

[4] Takahashi, Naoya, et al. “Recursive Speech Separation for Unknown Number of Speakers.” Interspeech 2019, ISCA, 2019, pp. 1348–52. DOI.org (Crossref), https://doi.org/10.21437/Interspeech.2019-1550.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

Introduced in R2023b

separateSpeakers

Syntax

Description

Examples

Download `separateSpeakers` Functionality

Separate Unknown Number of Speakers from Signal

Separate Known Number of Speakers from Signal

Get Residual from Speech Separation

Separate Speakers Without Conserving Energy of Input

Input Arguments

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

Name-Value Arguments

`NumSpeakers` — Number of speakers to separate
`1` | `2` | `3`

`ConserveEnergy` — Scale outputs to conserve input energy
`true` (default) | `false`

Output Arguments

`y` — Audio signal separated by speakers
N-by-C matrix

`r` — Residual signal
N-by-1 vector

Algorithms

One-And-Rest Speech Separation

References

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

separateSpeakers

Syntax

Description

Examples

Download separateSpeakers Functionality

Separate Unknown Number of Speakers from Signal

Separate Known Number of Speakers from Signal

Get Residual from Speech Separation

Separate Speakers Without Conserving Energy of Input

Input Arguments

audioIn — Audio input column vector

fs — Sample rate (Hz) positive scalar

Name-Value Arguments

NumSpeakers — Number of speakers to separate 1 | 2 | 3

ConserveEnergy — Scale outputs to conserve input energy true (default) | false

Output Arguments

y — Audio signal separated by speakers N-by-C matrix

r — Residual signal N-by-1 vector

Algorithms

One-And-Rest Speech Separation

References

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

Download `separateSpeakers` Functionality

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

`NumSpeakers` — Number of speakers to separate
`1` | `2` | `3`

`ConserveEnergy` — Scale outputs to conserve input energy
`true` (default) | `false`

`y` — Audio signal separated by speakers
N-by-C matrix

`r` — Residual signal
N-by-1 vector

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.