Main Content

This example shows how to define a text encoder model function.

In the context of deep learning, an encoder is the part of a deep learning network that maps the input to some latent space. You can use these vectors for various tasks. For example,

Classification by applying a softmax operation to the encoded data and using cross entropy loss.

Sequence-to-sequence translation by using the encoded vector as a context vector.

The file `sonnets.txt`

contains all of Shakespeare's sonnets in a single text file.

Read the Shakespeare's Sonnets data from the file `"sonnets.txt"`

.

```
filename = "sonnets.txt";
textData = fileread(filename);
```

The sonnets are indented by two whitespace characters. Remove the indentations using `replace`

and split the text into separate lines using the `split`

function. Remove the header from the first nine elements and the short sonnet titles.

textData = replace(textData," ",""); textData = split(textData,newline); textData(1:9) = []; textData(strlength(textData)<5) = [];

Create a function that tokenizes and preprocesses the text data. The function `preprocessText`

, listed at the end of the example, performs these steps:

Prepends and appends each input string with the specified start and stop tokens, respectively.

Tokenize the text using

`tokenizedDocument`

.

Preprocess the text data and specify the start and stop tokens `"<start>"`

and `"<stop>"`

, respectively.

startToken = "<start>"; stopToken = "<stop>"; documents = preprocessText(textData,startToken,stopToken);

Create a word encoding object from the tokenized documents.

enc = wordEncoding(documents);

When training a deep learning model, the input data must be a numeric array containing sequences of a fixed length. Because the documents have different lengths, you must pad the shorter sequences with a padding value.

Recreate the word encoding to also include a padding token and determine the index of that token.

```
paddingToken = "<pad>";
newVocabulary = [enc.Vocabulary paddingToken];
enc = wordEncoding(newVocabulary);
paddingIdx = word2ind(enc,paddingToken)
```

paddingIdx = 3595

The goal of the encoder is to map sequences of word indices to vectors in some latent space.

Initialize the parameters for the following model.

This model uses three operations:

The embedding maps word indices in the range 1 though

`vocabularySize`

to vectors of dimension`embeddingDimension`

, where`vocabularySize`

is the number of words in the encoding vocabulary and`embeddingDimension`

is the number of components learned by the embedding.The LSTM operation takes as input sequences of word vectors and outputs 1-by-

`numHiddenUnits`

vectors, where`numHiddenUnits`

is the number of hidden units in the LSTM operation.The fully connected operation multiplies the input by a weight matrix adding bias and outputs vectors of size

`latentDimension`

, where`latentDimension`

is the dimension of the latent space.

Specify the dimensions of the parameters.

embeddingDimension = 100; numHiddenUnits = 150; latentDimension = 50; vocabularySize = enc.NumWords;

Create a struct for the parameters.

parameters = struct;

Initialize the weights of the embedding using the Gaussian using the `initializeGaussian`

function which is attached to this example as a supporting file. Specify a mean of 0 and a standard deviation of 0.01. To learn more, see Gaussian Initialization.

mu = 0; sigma = 0.01; parameters.emb.Weights = initializeGaussian([embeddingDimension vocabularySize],mu,sigma);

Initialize the learnable parameters for the encoder LSTM operation:

Initialize the input weights with the Glorot initializer using the

`initializeGlorot`

function which is attached to this example as a supporting file. To learn more, see Glorot Initialization.Initialize the recurrent weights with the orthogonal initializer using the

`initializeOrthogonal`

function which is attached to this example as a supporting file. To learn more, see Orthogonal Initialization.Initialize the bias with the unit forget gate initializer using the

`initializeUnitForgetGate`

function which is attached to this example as a supporting file. To learn more, see Unit Forget Gate Initialization.

The sizes of the learnable parameters depend on the size of the input. Because the inputs to the LSTM operation are sequences of word vectors from the embedding operation, the number of input channels is `embeddingDimension`

.

The input weight matrix has size

`4*numHiddenUnits`

-by-`inputSize`

, where`inputSize`

is the dimension of the input data.The recurrent weight matrix has size

`4*numHiddenUnits`

-by-`numHiddenUnits`

.The bias vector has size

`4*numHiddenUnits`

-by-1.

sz = [4*numHiddenUnits embeddingDimension]; numOut = 4*numHiddenUnits; numIn = embeddingDimension; parameters.lstmEncoder.InputWeights = initializeGlorot(sz,numOut,numIn); parameters.lstmEncoder.RecurrentWeights = initializeOrthogonal([4*numHiddenUnits numHiddenUnits]); parameters.lstmEncoder.Bias = initializeUnitForgetGate(numHiddenUnits);

Initialize the learnable parameters for the encoder fully connected operation:

Initialize the weights with the Glorot initializer.

Initialize the bias with zeros using the

`initializeZeros`

function which is attached to this example as a supporting file. To learn more, see Zeros Initialization.

The sizes of the learnable parameters depend on the size of the input. Because the inputs to the fully connected operation are the outputs of the LSTM operation, the number of input channels is `numHiddenUnits`

. To make the fully connected operation output vectors with size `latentDimension`

, specify an output size of `latentDimension`

.

The weights matrix has size

`outputSize`

-by-`inputSize`

, where`outputSize`

and`inputSize`

correspond to the output and input dimensions, respectively.The bias vector has size

`outputSize`

-by-1.

sz = [latentDimension numHiddenUnits]; numOut = latentDimension; numIn = numHiddenUnits; parameters.fcEncoder.Weights = initializeGlorot(sz,numOut,numIn); parameters.fcEncoder.Bias = initializeZeros([latentDimension 1]);

Create the function `modelEncoder`

, listed in the Encoder Model Function section of the example, that computes the output of the encoder model. The `modelEncoder`

function, takes as input sequences of word indices, the model parameters, and the sequence lengths, and returns the corresponding latent feature vector.

To train the model using a custom training loop, you must iterate over mini-batches of data and convert it into the format required for the encoder model and the model gradients functions. This section of the example illustrates the steps needed for preparing a mini-batch of data inside the custom training loop.

Prepare an example mini-batch of data. Select a mini-batch of 32 documents from `documents`

. This represents the mini-batch of data used in an iteration of a custom training loop.

miniBatchSize = 32; idx = 1:miniBatchSize; documentsBatch = documents(idx);

Convert the documents to sequences using the `doc2sequence`

function and specify to right-pad the sequences with the word index corresponding to the padding token.

X = doc2sequence(enc,documentsBatch, ... 'PaddingDirection','right', ... 'PaddingValue',paddingIdx);

The output of the `doc2sequence`

function is a cell array, where each element is a row vector of word indices. Because the encoder model function requires numeric input, concatenate the rows of the data using the `cat`

function and specify to concatenate along the first dimension. The output has size `miniBatchSize`

-by-`sequenceLength`

, where `sequenceLength`

is the length of the longest sequence in the mini-batch.

X = cat(1,X{:}); size(X)

`ans = `*1×2*
32 14

Convert the data to a `dlarray`

with format `'BTC'`

(batch, time, channel). The software automatically rearranges the output to have format `'CTB'`

so the output has size `1`

-by-`miniBatchSize`

-by-`sequenceLength`

.

```
dlX = dlarray(X,'BTC');
size(dlX)
```

`ans = `*1×3*
1 32 14

For masking, calculate the unpadded sequence lengths of the input data using the `doclength`

function with the mini-batch of documents as input.

sequenceLengths = doclength(documentsBatch);

This code snippet shows an example of preparing a mini-batch in a custom training loop.

iteration = 0; % Loop over epochs. for epoch = 1:numEpochs % Loop over mini-batches. for i = 1:numIterationsPerEpoch iteration = iteration + 1; % Read mini-batch. idx = (i-1)*miniBatchSize+1:i*miniBatchSize; documentsBatch = documents(idx); % Convert to sequences. X = doc2sequence(enc,documentsBatch, ... 'PaddingDirection','right', ... 'PaddingValue',paddingIdx); X = cat(1,X{:}); % Convert to dlarray. dlX = dlarray(X,'BTC'); % Calculate sequence lengths. sequenceLengths = doclength(documentsBatch); % Evaluate model gradients. % ... % Update learnable parameters. % ... end end

When training a deep learning model with a custom training loop, you must calculate the gradients of the loss with respect to the learnable parameters. This calculation depends on the output of a forward pass of the model function.

To perform a forward pass of the encoder, use the `modelEncoder`

function directly with the parameters, data, and sequence lengths as input. The output is a `latentDimension`

-by-`miniBatchSize`

matrix.

dlZ = modelEncoder(parameters,dlX,sequenceLengths); size(dlZ)

`ans = `*1×2*
50 32

This code snippet shows an example of using a model encoder function inside the model gradients function.

function gradients = modelGradients(parameters,dlX,sequenceLengths) dlZ = modelEncoder(parameters,dlX,sequenceLengths); % Calculate loss. % ... % Calculate gradients. % ... end

This code snippet shows an example of evaluating the model gradients in a custom training loop.

iteration = 0; % Loop over epochs. for epoch = 1:numEpochs % Loop over mini-batches. for i = 1:numIterationsPerEpoch iteration = iteration + 1; % Prepare mini-batch. % ... % Evaluate model gradients. gradients = dlfeval(@modelGradients, parameters, dlX, sequenceLengths); % Update learnable parameters. [parameters,trailingAvg,trailingAvgSq] = adamupdate(parameters,gradients, ... trailingAvg,trailingAvgSq,iteration); end end

The `modelEncoder`

function, takes as input the model parameters, sequences of word indices, and the sequence lengths, and returns the corresponding latent feature vector.

Because the input data contains padded sequences of different lengths, the padding can have adverse effects on loss calculations. For the LSTM operation, instead of returning the output of the last time step of the sequence (which likely corresponds to the LSTM state after processing lots of padding values), determine the actual last time step given by the `sequenceLengths`

input.

function dlZ = modelEncoder(parameters,dlX,sequenceLengths) % Embedding. weights = parameters.emb.Weights; dlZ = embed(dlX,weights); % LSTM. inputWeights = parameters.lstmEncoder.InputWeights; recurrentWeights = parameters.lstmEncoder.RecurrentWeights; bias = parameters.lstmEncoder.Bias; numHiddenUnits = size(recurrentWeights,2); hiddenState = zeros(numHiddenUnits,1,'like',dlX); cellState = zeros(numHiddenUnits,1,'like',dlX); dlZ1 = lstm(dlZ,hiddenState,cellState,inputWeights,recurrentWeights,bias); % Output mode 'last' with masking. miniBatchSize = size(dlZ1,2); dlZ = zeros(numHiddenUnits,miniBatchSize,'like',dlZ1); dlZ = dlarray(dlZ,'CB'); for n = 1:miniBatchSize t = sequenceLengths(n); dlZ(:,n) = dlZ1(:,n,t); end % Fully connect. weights = parameters.fcEncoder.Weights; bias = parameters.fcEncoder.Bias; dlZ = fullyconnect(dlZ,weights,bias); end

The function `preprocessText`

performs these steps:

Prepends and appends each input string with the specified start and stop tokens, respectively.

Tokenize the text using

`tokenizedDocument`

.

function documents = preprocessText(textData,startToken,stopToken) % Add start and stop tokens. textData = startToken + textData + stopToken; % Tokenize the text. documents = tokenizedDocument(textData,'CustomTokens',[startToken stopToken]); end

`dlfeval`

| `dlgradient`

| `dlarray`