speech2text

Transcribe speech signal to text

Since R2022b

Syntax

transcript = speech2text(audioIn,fs)

transcript = speech2text(audioIn,fs,Name=Value)

[transcript,rawOutput] = speech2text(___)

Description

transcript = speech2text(audioIn,fs) transcribes speech in the input audio signal to text using a pretrained wav2vec 2.0 model.

Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.

example

transcript = speech2text(audioIn,fs,Name=Value) specifies options using one or more name-value arguments. For example, speech2text(x,fs,Language="es") transcribes a signal containing Spanish-language speech.

example

[transcript,rawOutput] = speech2text(___) also returns the unprocessed server output from the third-party speech service.

Examples

collapse all

Download wav2vec 2.0 Functionality

Open Live Script

Type speechClient("wav2vec2.0") into the command line. If the required model files are not installed, then the function throws an error and provides a link to download them. Click the link, and unzip the file to a location on the MATLAB path.

Alternatively, execute the following commands to download and unzip the wav2vec model files to your temporary directory.

downloadFolder = fullfile(tempdir,"wav2vecDownload");
loc = websave(downloadFolder,"https://ssd.mathworks.com/supportfiles/audio/asr-wav2vec2-librispeech.zip");
modelsLocation = tempdir;
unzip(loc,modelsLocation)
addpath(fullfile(modelsLocation,"asr-wav2vec2-librispeech"))

Check that the installation is successful by typing speechClient("wav2vec2.0") into the command line. If the files are installed, then the function returns a Wav2VecSpeechClient object.

speechClient("wav2vec2.0")

ans = 
  Wav2VecSpeechClient with properties:

    Segmentation: 'word'
      TimeStamps: 0
        Language: 'english'

Perform Speech-to-Text Transcription

Open Live Script

Read in an audio file containing speech and listen to it.

[y,fs] = audioread("speech_dft.wav");
sound(y,fs)

Use speech2text to transcribe the audio signal using the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.

transcript = speech2text(y,fs)

transcript = 
"the discreet forier transform of a real valued signal is conjugate symmetric"

Use Emformer for Streaming Speech-to-Text

Open Live Script

Create a speechClient object that uses the Emformer pretrained model.

emformerSpeechClient = speechClient("emformer");

Create a dsp.AudioFileReader object to read in an audio file. In a streaming loop, read in frames of the audio file and transcribe the speech using speech2text with the Emformer speechClient. The Emformer speechClient object maintains an internal state to perform the streaming speech-to-text transcription.

afr = dsp.AudioFileReader("Counting-16-44p1-mono-15secs.wav");
txtTotal = "";
while ~isDone(afr)
    x = afr();
    txt = speech2text(x,afr.SampleRate,Client=emformerSpeechClient);
    txtTotal = txtTotal + txt;
end

txtTotal

txtTotal = 
"one two three four five six seven eight nine"

Transcribe Speech Containing Spanish

Open Live Script

Read in an audio file containing speech in the Spanish language and listen to it.

[x,fs] = audioread("spanish.wav");
sound(x,fs)

Use speech2text with Language set to "spanish" to transcribe the speech.

transcript = speech2text(x,fs,Language="spanish")

transcript = 
"la inductancia mutua de los circuitos depende exclusivamente de la geometría de los mismos."

Translate and Transcribe Speech with Whisper

Open Live Script

Create a speechClient object that uses a Whisper pretrained model. Set Task to "translate" to translate other languages into English when performing speech-to-text with this object.

whisperSpeechClient = speechClient("whisper",Task="translate");

Read in a speech signal containing Polish and listen to it.

[x,fs] = audioread("polish.wav");
sound(x,fs)

Call speech2text on the signal with Client set to the Whisper client object to simultaneously translate and transcribe the speech.

translatedTranscript = speech2text(x,fs,Client=whisperSpeechClient)

translatedTranscript = 
"Good day, I am Polish."

Input Arguments

collapse all

`audioIn` — Audio input
column vector

Audio input signal, specified as a column vector (single channel).

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate in Hz, specified as a positive scalar.

Data Types: single | double

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: speech2text(x,fs,Language="es")

`Client` — Client object
`speechClient("wav2vec2.0")` (default) | `speechClient` object

Client object, specified as an object returned by speechClient. The object is an interface to a pretrained model or to a third-party speech service. By default, speech2text uses a wav2vec 2.0 client object.

Using speech2text with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not installed, calling speechClient with "wav2vec2.0" provides a link to download and install the model.

Using the Emformer or Whisper models requires Deep Learning Toolbox and Audio Toolbox™ Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, calling speechClient with "emformer" provides a link to the Add-On Explorer, where you can download and install the support package.

To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

Example: speechClient("wav2vec2.0")

`Language` — Language spoken in input signal
`"english"` (default) | `"spanish"` | `"italian"` | `"french"` | `"german"`

Language spoken in the input signal specified as "english", "spanish", "italian", "french", or "german". You can also specify the ISO language codes ("en", "es", "it", "fr", "de").

This argument applies only when using the default Client. If you specify Client, set the Language property on the client object.

Data Types: char | string

Output Arguments

collapse all

`transcript` — Speech transcript
table | string

Speech transcript of the input audio signal, returned as a table with a column containing the transcript and another column containing the associated confidence metrics. If the Segmentation property of Client is "none", speech2text returns the transcript as a string.

The returned table can have additional columns depending on the speechClient properties and server options.

Data Types: table | string

`rawOutput` — Unprocessed server output
`ResponseMessage` | structure

Unprocessed server output, returned as a matlab.net.http.ResponseMessage object containing the HTTP response from the third-party speech service. If the third-party speech service is Amazon^®, speech2text returns the server output as a structure.

This output argument does not apply if Client interfaces with a pretrained model.

References

[1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

The speech2text function accepts gpuArray input only when using wav2vec 2.0. If you are using Emformer or Whisper models, specify the ExecutionEnvironment property on the speechClient object to utilize a GPU.

Version History

Introduced in R2022b

speech2text

Syntax

Description

Examples

Download wav2vec 2.0 Functionality

Perform Speech-to-Text Transcription

Use Emformer for Streaming Speech-to-Text

Transcribe Speech Containing Spanish

Translate and Transcribe Speech with Whisper

Input Arguments

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

Name-Value Arguments

`Client` — Client object
`speechClient("wav2vec2.0")` (default) | `speechClient` object

`Language` — Language spoken in input signal
`"english"` (default) | `"spanish"` | `"italian"` | `"french"` | `"german"`

Output Arguments

`transcript` — Speech transcript
table | string

`rawOutput` — Unprocessed server output
`ResponseMessage` | structure

References

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

speech2text

Syntax

Description

Examples

Download wav2vec 2.0 Functionality

Perform Speech-to-Text Transcription

Use Emformer for Streaming Speech-to-Text

Transcribe Speech Containing Spanish

Translate and Transcribe Speech with Whisper

Input Arguments

audioIn — Audio input column vector

fs — Sample rate (Hz) positive scalar

Name-Value Arguments

Client — Client object speechClient("wav2vec2.0") (default) | speechClient object

Language — Language spoken in input signal "english" (default) | "spanish" | "italian" | "french" | "german"

Output Arguments

transcript — Speech transcript table | string

rawOutput — Unprocessed server output ResponseMessage | structure

References

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

`Client` — Client object
`speechClient("wav2vec2.0")` (default) | `speechClient` object

`Language` — Language spoken in input signal
`"english"` (default) | `"spanish"` | `"italian"` | `"french"` | `"german"`

`transcript` — Speech transcript
table | string

`rawOutput` — Unprocessed server output
`ResponseMessage` | structure

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.