Main Content

separateSpeakers

Separate signal by speakers

Since R2023b

    Description

    example

    y = separateSpeakers(audioIn,fs) separates audio with overlapping speech into signals containing the isolated speech of each speaker.

    example

    y = separateSpeakers(audioIn,fs,Name=Value) specifies options using one or more name-value arguments. For example, separateSpeakers(audioIn,fs,NumSpeakers=3) separates a speech signal that is known to contain three speakers.

    example

    [y,r] = separateSpeakers(___) also returns the residual signal after performing iterative speaker separation. Use this syntax in combination with any of the input arguments in previous syntaxes. This syntax does not apply if NumSpeakers is 2 or 3.

    example

    separateSpeakers(___) with no output arguments plots the input signal and the separated speaker signals. This function also plots the residual signal if NumSpeakers is unspecified or set to 1.

    This function requires both Audio Toolbox™ and Deep Learning Toolbox™.

    Examples

    collapse all

    Try calling separateSpeakers in the command line. If the required model files are not installed, then the function throws an error and provides a link to download them. Click the link, and unzip the file to a location on the MATLAB path.

    Alternatively, execute the following commands to download and unzip the separateSpeakers model files to your temporary directory.

    downloadFolder = fullfile(tempdir,"separateSpeakerDownload");
    loc = websave(downloadFolder,"https://ssd.mathworks.com/supportfiles/audio/separateSpeakers.zip");
    modelsLocation = tempdir;
    unzip(loc,modelsLocation)
    addpath(fullfile(modelsLocation,"separateSpeakers"))

    Create an audio signal that combines the speech of two speakers. Listen to the mixed signal.

    [s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
    s = sum(s(:,1:2),2);
    x = s./max(abs(s));
    sound(x,fs)

    Call separateSpeakers to separate the individual speakers from the signal. By default, separateSpeakers estimates how many speakers to separate from the input. Inspect the output dimensions to see how many speakers the function separates. In this case, separateSpeakers correctly detects two different speakers.

    y = separateSpeakers(x,fs);
    size(y)
    ans = 1×2
    
           40000           2
    
    

    Listen to the first separated speaker.

    sound(y(:,1),fs)

    Listen to the second separated speaker.

    sound(y(:,2),fs)

    Call separateSpeakers with no output arguments to plot the input signal, the separated signals, the residual, and the reconstructed input.

    separateSpeakers(x,fs)

    Figure contains 4 axes objects. Axes object 1 with ylabel Mix contains 2 objects of type line. One or more of the lines displays its values using only markers These objects represent Input, Reconstruction. Axes object 2 with ylabel Speaker 1 contains an object of type line. Axes object 3 with ylabel Speaker 2 contains an object of type line. Axes object 4 with xlabel Time (s), ylabel Residual contains an object of type line.

    Create an audio signal that combines the speech of three speakers with different scaling factors. Listen to the mixed signal.

    [s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
    x = sum(s(:,1:3).*[1,0.5,0.1],2);
    x = x./max(abs(x));
    sound(x,fs)

    Call separateSpeakers with NumSpeakers set to 3 to separate the three known speakers from the signal.

    y = separateSpeakers(x,fs,NumSpeakers=3);

    Listen to the first separated speaker.

    sound(y(:,1),fs)

    Listen to the second separated speaker

    sound(y(:,2),fs)

    Listen to the third separated speaker.

    sound(y(:,3),fs)

    Call separateSpeakers with no output arguments to plot the input signal, the separated signals, the residual, and the reconstructed input.

    separateSpeakers(x,fs,NumSpeakers=3)

    Figure contains 4 axes objects. Axes object 1 with ylabel Mix contains 2 objects of type line. One or more of the lines displays its values using only markers These objects represent Input, Reconstruction. Axes object 2 with ylabel Speaker 1 contains an object of type line. Axes object 3 with ylabel Speaker 2 contains an object of type line. Axes object 4 with xlabel Time (s), ylabel Speaker 3 contains an object of type line.

    If you do not specify NumSpeakers, or if you specify NumSpeakers as 1, separateSpeakers also returns the residual signal. The residual is the part of the original signal that is "left over" after separating out the speaker or speakers.

    Create an audio signal that combines the speech of two speakers. Listen to the mixed signal.

    [s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
    s = sum(s(:,1:2),2);
    x = s./max(abs(s));
    sound(x,fs)

    Call separateSpeakers with NumSpeakers set to 1 to separate a single speaker signal from the input. Specify an additional output argument r to obtain the residual. Listen to the separated speaker.

    [y,r] = separateSpeakers(x,fs,NumSpeakers=1);
    sound(y,fs)

    Listen to the residual signal.

    sound(r,fs)

    Call detectspeechnn with no output arguments to plot the detected speech in the residual.

    detectspeechnn(r,fs)

    Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 5 objects of type line, constantline, patch.

    Create an audio signal that combines the speech of two speakers. Listen to the mixed signal.

    [s,fs] = audioread("MultipleSpeakers-16-8-4channel-5secs.flac");
    s = sum(s(:,1:2),2);
    x = s./max(abs(s));
    sound(x,fs)

    Call separateSpeakers with ConserveEnergy set to false to simply scale the output signals to have a maximum absolute value of 1. Call the function with no output arguments to plot the signals.

    separateSpeakers(x,fs,ConserveEnergy=false)

    Figure contains 4 axes objects. Axes object 1 with ylabel Mix contains an object of type line. Axes object 2 with ylabel Speaker 1 contains an object of type line. Axes object 3 with ylabel Speaker 2 contains an object of type line. Axes object 4 with xlabel Time (s), ylabel Residual contains an object of type line.

    Input Arguments

    collapse all

    Audio input, specified as a column vector (single channel).

    Data Types: single | double

    Sample rate in Hz, specified as a positive scalar.

    Data Types: single | double

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: separateSpeakers(audioIn,fs,NumSpeakers=3,ConserveEnergy=false)

    Number of speakers to separate, specified as 1, 2, or 3. If you do not specify NumSpeakers, separateSpeakers estimates the number of speakers. For more information, see One-And-Rest Speech Separation.

    Data Types: single | double

    Scale the output signals to conserve the input energy, specified as true or false.

    • If ConserveEnergy is true, separateSpeakers attempts to scale the output signals so that their sum reconstructs the input signal. The energy conservation algorithm includes the residual signal r if NumSpeakers is unspecified or set to 1.

    • If ConserveEnergy is false, separateSpeakers scales each speaker signal and the residual to have a maximum absolute value of 1.

    Data Types: logical

    Output Arguments

    collapse all

    Audio signal separated by speakers, returned as an N-by-C matrix that contains an individual speaker signal in each column.

    • N is the length of the input signal audioIn in samples.

    • C is the number of speakers. You can define the number of speakers by specifying NumSpeakers. Otherwise, the function estimates the number of speakers through One-And-Rest Speech Separation.

    The separated speaker signals have the same sample rate as the input signal.

    Data Types: single

    Residual signal from iterative speaker separation, returned as an N-by-1 vector, where N is the length of the input signal audioIn in samples. For more information about iterative speaker separation, see One-And-Rest Speech Separation.

    The separateSpeakers function does not return a residual if you specify NumSpeakers as 2 or 3.

    Data Types: single

    Algorithms

    collapse all

    The separateSpeakers function uses a pretrained deep learning model to separate the individual speaker signals. The model that it uses depends on the NumSpeakers argument.

    • If you do not specify NumSpeakers, separateSpeakers uses a Conv-TasNet [1] model with "one-and-rest" iterative speaker separation [4].

    • If you set NumSpeakers to 1, separateSpeakers uses a Conv-TasNet model with one iteration of one-and-rest separation.

    • If you set NumSpeakers to 2, separateSpeakers uses a SepFormer [3] model trained to output two speaker signals. This neural network uses pretrained weights from the sepformer-libri2mix model provided by SpeechBrain [2].

    • If you set NumSpeakers to 3, separateSpeakers uses a SepFormer model trained to output three speaker signals. This neural network uses pretrained weights from the sepformer-libri3mix model provided by SpeechBrain [2].

    One-And-Rest Speech Separation

    The separateSpeakers function can separate speech from a signal with an unknown number of speakers using a model that is trained to perform one-and-rest speech separation. In one-and-rest speech separation, the model takes a mixed speech signal and returns two signals: the speech of one individual speaker and the "rest" of the signal, which is the residual of the original signal after separating out the one speaker.

    The function uses one-and-rest speech separation iteratively. First, it separates the mixed speech signal into one speaker and the residual. Then, it uses voice activity detection (VAD) to determine if the residual contains more speakers. If it does not detect speech in the residual, the function stops and returns the separated speakers. Otherwise, it repeats the process and performs one-and-rest speech separation on the residual signal.

    Flow chart showing the first two iterations of one-and-rest speech separation

    References

    [1] Luo, Yi, and Nima Mesgarani. “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, Aug. 2019, pp. 1256–66. DOI.org (Crossref), https://doi.org/10.1109/TASLP.2019.2915167.

    [2] Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624

    [3] Subakan, Cem, et al. “Attention Is All You Need In Speech Separation.” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 21–25. DOI.org (Crossref), https://doi.org/10.1109/ICASSP39728.2021.9413901.

    [4] Takahashi, Naoya, et al. “Recursive Speech Separation for Unknown Number of Speakers.” Interspeech 2019, ISCA, 2019, pp. 1348–52. DOI.org (Crossref), https://doi.org/10.21437/Interspeech.2019-1550.

    Extended Capabilities

    GPU Arrays
    Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

    Version History

    Introduced in R2023b