speech2text
Syntax
Description
transcribes speech in the input audio signal to text using a pretrained wav2vec 2.0
model.transcript
= speech2text(audioIn
,fs
)
Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.
specifies options using one or more name-value arguments. For example,
transcript
= speech2text(audioIn
,fs
,Name=Value
)speech2text(x,fs,Language="es")
transcribes a signal containing
Spanish-language speech.
[
also returns the unprocessed server output from the third-party speech service.transcript
,rawOutput
] = speech2text(___)
Examples
Download wav2vec 2.0 Functionality
Type speechClient("wav2vec2.0")
into the command line. If the required model files are not installed, then the function throws an error and provides a link to download them. Click the link, and unzip the file to a location on the MATLAB path.
Alternatively, execute the following commands to download and unzip the wav2vec model files to your temporary directory.
downloadFolder = fullfile(tempdir,"wav2vecDownload"); loc = websave(downloadFolder,"https://ssd.mathworks.com/supportfiles/audio/asr-wav2vec2-librispeech.zip"); modelsLocation = tempdir; unzip(loc,modelsLocation) addpath(fullfile(modelsLocation,"asr-wav2vec2-librispeech"))
Check that the installation is successful by typing speechClient("wav2vec2.0")
into the command line. If the files are installed, then the function returns a Wav2VecSpeechClient
object.
speechClient("wav2vec2.0")
ans = Wav2VecSpeechClient with properties: Segmentation: 'word' TimeStamps: 0 Language: 'english'
Perform Speech-to-Text Transcription
Read in an audio file containing speech and listen to it.
[y,fs] = audioread("speech_dft.wav");
sound(y,fs)
Use speech2text
to transcribe the audio signal using the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.
transcript = speech2text(y,fs)
transcript = "the discreet forier transform of a real valued signal is conjugate symmetric"
Use Emformer for Streaming Speech-to-Text
Create a speechClient
object that uses the Emformer pretrained model.
emformerSpeechClient = speechClient("emformer");
Create a dsp.AudioFileReader
object to read in an audio file. In a streaming loop, read in frames of the audio file and transcribe the speech using speech2text
with the Emformer speechClient
. The Emformer speechClient
object maintains an internal state to perform the streaming speech-to-text transcription.
afr = dsp.AudioFileReader("Counting-16-44p1-mono-15secs.wav"); txtTotal = ""; while ~isDone(afr) x = afr(); txt = speech2text(x,afr.SampleRate,Client=emformerSpeechClient); txtTotal = txtTotal + txt; end txtTotal
txtTotal = "one two three four five six seven eight nine"
Transcribe Speech Containing Spanish
Read in an audio file containing speech in the Spanish language and listen to it.
[x,fs] = audioread("spanish.wav");
sound(x,fs)
Use speech2text
with Language
set to "spanish"
to transcribe the speech.
transcript = speech2text(x,fs,Language="spanish")
transcript = "la inductancia mutua de los circuitos depende exclusivamente de la geometría de los mismos."
Translate and Transcribe Speech with Whisper
Create a speechClient
object that uses a Whisper pretrained model. Set Task
to "translate"
to translate other languages into English when performing speech-to-text with this object.
whisperSpeechClient = speechClient("whisper",Task="translate");
Read in a speech signal containing Polish and listen to it.
[x,fs] = audioread("polish.wav");
sound(x,fs)
Call speech2text
on the signal with Client
set to the Whisper client object to simultaneously translate and transcribe the speech.
translatedTranscript = speech2text(x,fs,Client=whisperSpeechClient)
translatedTranscript = "Good day, I am Polish."
Input Arguments
audioIn
— Audio input
column vector
Audio input signal, specified as a column vector (single channel).
Data Types: single
| double
fs
— Sample rate (Hz)
positive scalar
Sample rate in Hz, specified as a positive scalar.
Data Types: single
| double
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: speech2text(x,fs,Language="es")
Client
— Client object
speechClient("wav2vec2.0")
(default) | speechClient
object
Client object, specified as an object returned by speechClient
. The object is an interface to a pretrained model or to a
third-party speech service. By default, speech2text
uses a
wav2vec 2.0 client object.
Using speech2text
with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not
installed, calling speechClient
with
"wav2vec2.0"
provides a link to download and install the
model.
Using the Emformer or Whisper models requires Deep Learning Toolbox and Audio Toolbox™ Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, calling
speechClient
with "emformer"
provides a link
to the Add-On Explorer, where you can download and install the support package.
To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.
Example: speechClient("wav2vec2.0")
Language
— Language spoken in input signal
"english"
(default) | "spanish"
| "italian"
| "french"
| "german"
Language spoken in the input signal specified as "english"
,
"spanish"
, "italian"
,
"french"
, or "german"
. You can also specify
the ISO language codes ("en"
, "es"
,
"it"
, "fr"
, "de"
).
This argument applies only when using the default Client
. If
you specify Client
, set the Language
property on the client object.
Data Types: char
| string
Output Arguments
transcript
— Speech transcript
table | string
Speech transcript of the input audio signal, returned as a table with a column
containing the transcript and another column containing the associated confidence
metrics. If the Segmentation
property of
Client
is "none"
,
speech2text
returns the transcript as a string.
The returned table can have additional columns depending on the speechClient
properties and server options.
Data Types: table
| string
rawOutput
— Unprocessed server output
ResponseMessage
| structure
Unprocessed server output, returned as a matlab.net.http.ResponseMessage
object containing the HTTP response from the
third-party speech service. If the third-party speech service is Amazon®, speech2text
returns the server output as a
structure.
This output argument does not apply if Client
interfaces with a
pretrained model.
References
[1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.
Extended Capabilities
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
The speech2text
function accepts gpuArray
input
only when using wav2vec 2.0. If you are using Emformer or Whisper models, specify the
ExecutionEnvironment
property on the speechClient
object to utilize a GPU.
Version History
Introduced in R2022b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)