Main Content

speech2text

Transcribe speech signal to text

Since R2022b

    Description

    example

    transcript = speech2text(audioIn,fs) transcribes speech in the input audio signal to text using a pretrained wav2vec 2.0 model.

    Note

    Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.

    example

    transcript = speech2text(audioIn,fs,Client=clientObj) transcribes speech using the specified pretrained deep learning model or third-party speech service.

    Note

    Using the Emformer pretrained model requires Deep Learning Toolbox and Audio Toolbox™ Interface for SpeechBrain and Torchaudio Libraries. You can download this support package from the Add-On Explorer. For more information, see Get and Manage Add-Ons.

    To use third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    [transcript,rawOutput] = speech2text(___) also returns the unprocessed server output from the third-party speech service.

    Examples

    collapse all

    Download and install the pretrained wav2vec 2.0 model for speech-to-text transcription.

    Type speechClient("wav2vec2.0") into the command line. If the pretrained model for wav2vec 2.0 is not installed, the function provides a download link. To install the model, click the link to download the file and unzip it to a location on the MATLAB path.

    Alternatively, execute the following commands to download the wav2vec 2.0 model, unzip it to your temporary directory, and then add it to your MATLAB path.

    downloadFile = matlab.internal.examples.downloadSupportFile("audio","wav2vec2/wav2vec2-base-960.zip");
    wav2vecLocation = fullfile(tempdir,"wav2vec");
    unzip(downloadFile,wav2vecLocation)
    addpath(wav2vecLocation)

    Check that the installation is successful by typing speechClient("wav2vec2.0") into the command line. If the model is installed, then the function returns a Wav2VecSpeechClient object.

    speechClient("wav2vec2.0")
    ans = 
      Wav2VecSpeechClient with properties:
    
        Segmentation: 'word'
          TimeStamps: 0
    
    

    Read in an audio file containing speech and listen to it.

    [y,fs] = audioread("speech_dft.wav");
    sound(y,fs)

    Use speech2text to transcribe the audio signal using the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.

    transcript = speech2text(y,fs)
    transcript = 
    "the discreet forier transform of a real valued signal is conjugate symmetric"
    

    Create a speechClient object that uses the Emformer pretrained model.

    emformerSpeechClient = speechClient("emformer");

    Create a dsp.AudioFileReader object to read in an audio file. In a streaming loop, read in frames of the audio file and transcribe the speech using speech2text with the Emformer speechClient. The Emformer speechClient object maintains an internal state to perform the streaming speech-to-text transcription.

    afr = dsp.AudioFileReader("Counting-16-44p1-mono-15secs.wav");
    txtTotal = "";
    while ~isDone(afr)
        x = afr();
        txt = speech2text(x,afr.SampleRate,Client=emformerSpeechClient);
        txtTotal = txtTotal + txt;
    end
    
    txtTotal
    txtTotal = 
    "one two three four five six seven eight nine"
    

    Input Arguments

    collapse all

    Audio input signal, specified as a column vector (single channel).

    Data Types: single | double

    Sample rate in Hz, specified as a positive scalar.

    Data Types: single | double

    Client object, specified as an object returned by speechClient. The object is an interface to a pretrained model or to a third-party speech service. By default, speech2text uses a wav2vec 2.0 client object.

    Using speech2text with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not installed, calling speechClient with "wav2vec2.0" provides a link to download and install the model.

    Using the Emformer model requires Deep Learning Toolbox and Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, calling speechClient with "emformer" provides a link to the Add-On Explorer, where you can download and install the support package.

    To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    Example: speechClient("wav2vec2.0")

    Output Arguments

    collapse all

    Speech transcript of the input audio signal, returned as a table with a column containing the transcript and another column containing the associated confidence metrics. If the Segmentation property of clientObj is "none", speech2text returns the transcript as a string.

    The returned table can have additional columns depending on the speechClient properties and server options.

    Data Types: table | string

    Unprocessed server output, returned as a matlab.net.http.ResponseMessage object containing the HTTP response from the third-party speech service. If the third-party speech service is Amazon®, speech2text returns the server output as a structure.

    This output argument does not apply if clientObj interfaces with a pretrained model.

    References

    [1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.

    Version History

    Introduced in R2022b