Develop Custom Mini-Batch Datastore

A mini-batch datastore is an implementation of a datastore with support for reading data in batches. You can use a mini-batch datastore as a source of training, validation, test, and prediction data sets for deep learning applications that use Deep Learning Toolbox™.

To preprocess sequence, time series, or text data, build your own mini-batch datastore using the framework described here. For an example showing how to use a custom mini-batch datastore, see Train Network Using Custom Mini-Batch Datastore for Sequence Data.

Overview

Build your custom datastore interface using the custom datastore classes and objects. Then, use the custom datastore to bring your data into MATLAB^®.

Designing your custom mini-batch datastore involves inheriting from the matlab.io.Datastore and matlab.io.datastore.MiniBatchable classes, and implementing the required properties and methods. You optionally can add support for shuffling during training.

Processing Needs	Classes
Mini-batch datastore for training, validation, test, and prediction data sets in Deep Learning Toolbox	`matlab.io.Datastore` and `matlab.io.datastore.MiniBatchable` See Implement MiniBatchable Datastore.
Mini-batch datastore with support for shuffling during training	`matlab.io.Datastore`, `matlab.io.datastore.MiniBatchable`, and `matlab.io.datastore.Shuffleable` See Add Support for Shuffling.

Processing Needs

Classes

Mini-batch datastore for training, validation, test, and prediction data sets in Deep Learning Toolbox

matlab.io.Datastore and matlab.io.datastore.MiniBatchable

See Implement MiniBatchable Datastore.

Mini-batch datastore with support for shuffling during training

matlab.io.Datastore, matlab.io.datastore.MiniBatchable, and matlab.io.datastore.Shuffleable

See Add Support for Shuffling.

Implement `MiniBatchable` Datastore

To implement a custom mini-batch datastore named MyDatastore, create a script MyDatastore.m. The script must be on the MATLAB path and should contain code that inherits from the appropriate class and defines the required methods. The code for creating a mini-batch datastore for training, validation, test, and prediction data sets in Deep Learning Toolbox must:

Inherit from the classes matlab.io.Datastore and matlab.io.datastore.MiniBatchable.
Define these properties: MiniBatchSize and NumObservations.
Define these methods: hasdata, read, reset, and progress.

In addition to these steps, you can define any other properties or methods that you need to process and analyze your data.

Note

If you are training a network and trainingOptions specifies 'Shuffle' as 'once' or 'every-epoch', then you must also inherit from the matlab.io.datastore.Shuffleable class. For more information, see Add Support for Shuffling.

The datastore read function must return data in a table. The table elements must be scalars, row vectors, or 1-by-1 cell arrays containing a numeric array.

For networks with a single input layer, the first and second columns specify the predictors and responses, respectively.

Tip

To train a network with multiple input layers or multiple outputs, use the combine and transform functions to create a datastore that outputs a cell array with (numInputs + numOutputs) columns, where numInputs is the number of network inputs and numOutputs is the number of network outputs. The first numInputs columns specify the predictors for each input, and the last numOutputs columns specify the responses. The InputNames and OutputNames properties of the neural network determine the order of the inputs and outputs, respectively.

The format of the predictors depend on the type of data.

Data	Format of Predictors
2-D image	h-by-w-by-c numeric array, where h, w, and c are the height, width, and number of channels of the image, respectively.
3-D image	h-by-w-by-d-by-c numeric array, where h, w, d, and c are the height, width, depth, and number of channels of the image, respectively.
Vector sequence	s-by-c matrix, where s is the sequence length and c is the number of features of the sequence.
1-D image sequence	h-by-c-by-s array, where h and c correspond to the height and number of channels of the image, respectively, and s is the sequence length. Each sequence in the mini-batch must have the same sequence length.
2-D image sequence	h-by-w-by-c-by-s array, where h, w, and c correspond to the height, width, and number of channels of the image, respectively, and s is the sequence length. Each sequence in the mini-batch must have the same sequence length.
3-D image sequence	h-by-w-by-d-by-c-by-s array, where h, w, d, and c correspond to the height, width, depth, and number of channels of the image, respectively, and s is the sequence length. Each sequence in the mini-batch must have the same sequence length.
Features	c-by-1 column vector, where c is the number of features.

The table elements must contain a numeric scalar, a numeric row vector, or a 1-by-1 cell array containing a numeric array.

The format of the responses depend on the type of task.

Task	Format of Responses
Classification	Categorical scalar
Regression	Scalar Numeric vector 3-D numeric array representing an image
Sequence-to-sequence classification	1-by-s sequence of categorical labels, where s is the sequence length of the corresponding predictor sequence.
Sequence-to-sequence regression	R-by-s matrix, where R is the number of responses and s is the sequence length of the corresponding predictor sequence.

The table elements must contain a categorical scalar, a numeric scalar, a numeric row vector, or a 1-by-1 cell array containing a numeric array.

This example shows how to create a custom mini-batch datastore for processing sequence data. Save the script in a file called MySequenceDatastore.m.

Steps Implementation

Steps	Implementation
Begin defining your class. Inherit from the base class `matlab.io.Datastore` and the `matlab.io.datastore.MiniBatchable` class. Define properties. Redefine the `MiniBatchSize` and `NumObservations` properties. You optionally can assign additional property attributes to either property. For more information, see Property Attributes. You can also define properties unique to your custom mini-batch datastore. Define methods. Implement the custom mini-batch datastore constructor. Implement the `hasdata` method. Implement the `read` method, which must return data as a table with the predictors in the first column and responses in the second column. For sequence data, the sequences must be matrices of size c-by-s, where c is the number of features and s is sequence length. The value of s can vary between mini-batches. Implement the `reset` method. Implement the `progress` method. You can also define methods unique to your custom mini-batch datastore. End the `classdef` section.	classdef MySequenceDatastore < matlab.io.Datastore & ... matlab.io.datastore.MiniBatchable properties Datastore Labels NumClasses SequenceDimension MiniBatchSize end properties(SetAccess = protected) NumObservations end properties(Access = private) % This property is inherited from Datastore CurrentFileIndex end methods function ds = MySequenceDatastore(folder) % Construct a MySequenceDatastore object % Create a file datastore. The readSequence function is % defined following the class definition. fds = fileDatastore(folder, ... 'ReadFcn',@readSequence, ... 'IncludeSubfolders',true); ds.Datastore = fds; % Read labels from folder names numObservations = numel(fds.Files); for i = 1:numObservations file = fds.Files{i}; filepath = fileparts(file); [~,label] = fileparts(filepath); labels{i,1} = label; end ds.Labels = categorical(labels); ds.NumClasses = numel(unique(labels)); % Determine sequence dimension. When you define the LSTM % network architecture, you can use this property to % specify the input size of the sequenceInputLayer. X = preview(fds); ds.SequenceDimension = size(X,1); % Initialize datastore properties. ds.MiniBatchSize = 128; ds.NumObservations = numObservations; ds.CurrentFileIndex = 1; end function tf = hasdata(ds) % Return true if more data is available tf = ds.CurrentFileIndex + ds.MiniBatchSize - 1 ... <= ds.NumObservations; end function [data,info] = read(ds) % Read one mini-batch batch of data miniBatchSize = ds.MiniBatchSize; info = struct; for i = 1:miniBatchSize predictors{i,1} = read(ds.Datastore); responses(i,1) = ds.Labels(ds.CurrentFileIndex); ds.CurrentFileIndex = ds.CurrentFileIndex + 1; end data = preprocessData(ds,predictors,responses); end function data = preprocessData(ds,predictors,responses) % data = preprocessData(ds,predictors,responses) preprocesses % the data in predictors and responses and returns the table % data miniBatchSize = ds.MiniBatchSize; % Pad data to length of longest sequence. sequenceLengths = cellfun(@(X) size(X,2),predictors); maxSequenceLength = max(sequenceLengths); for i = 1:miniBatchSize X = predictors{i}; % Pad sequence with zeros. if size(X,2) < maxSequenceLength X(:,maxSequenceLength) = 0; end predictors{i} = X; end % Return data as a table. data = table(predictors,responses); end function reset(ds) % Reset to the start of the data reset(ds.Datastore); ds.CurrentFileIndex = 1; end end methods (Hidden = true) function frac = progress(ds) % Determine percentage of data read from datastore frac = (ds.CurrentFileIndex - 1) / ds.NumObservations; end end end % end class definition The implementation of the read method of your custom datastore uses a function called `readSequence`. You must create this function to read sequence data from a MAT-file. function data = readSequence(filename) % data = readSequence(filename) reads the sequence X from the MAT-file % filename S = load(filename); data = S.X; end

Begin defining your class. Inherit from the base class matlab.io.Datastore and the matlab.io.datastore.MiniBatchable class.
Define properties.
- Redefine the MiniBatchSize and NumObservations properties. You optionally can assign additional property attributes to either property. For more information, see Property Attributes.
- You can also define properties unique to your custom mini-batch datastore.
Define methods.
- Implement the custom mini-batch datastore constructor.
- Implement the hasdata method.
- Implement the read method, which must return data as a table with the predictors in the first column and responses in the second column.
  For sequence data, the sequences must be matrices of size c-by-s, where c is the number of features and s is sequence length. The value of s can vary between mini-batches.
- Implement the reset method.
- Implement the progress method.
- You can also define methods unique to your custom mini-batch datastore.
End the classdef section.

classdef MySequenceDatastore < matlab.io.Datastore & ...
                       matlab.io.datastore.MiniBatchable
    
    properties
        Datastore
        Labels
        NumClasses
        SequenceDimension
        MiniBatchSize
    end
    
    properties(SetAccess = protected)
        NumObservations
    end

    properties(Access = private)
        % This property is inherited from Datastore
        CurrentFileIndex
    end


    methods
        
        function ds = MySequenceDatastore(folder)
            % Construct a MySequenceDatastore object

            % Create a file datastore. The readSequence function is
            % defined following the class definition.
            fds = fileDatastore(folder, ...
                'ReadFcn',@readSequence, ...
                'IncludeSubfolders',true);
            ds.Datastore = fds;

            % Read labels from folder names
            numObservations = numel(fds.Files);
            for i = 1:numObservations
                file = fds.Files{i};
                filepath = fileparts(file);
                [~,label] = fileparts(filepath);
                labels{i,1} = label;
            end
            ds.Labels = categorical(labels);
            ds.NumClasses = numel(unique(labels));
            
            % Determine sequence dimension. When you define the LSTM
            % network architecture, you can use this property to
            % specify the input size of the sequenceInputLayer.
            X = preview(fds);
            ds.SequenceDimension = size(X,1);
            
            % Initialize datastore properties.
            ds.MiniBatchSize = 128;
            ds.NumObservations = numObservations;
            ds.CurrentFileIndex = 1;
        end

        function tf = hasdata(ds)
            % Return true if more data is available
            tf = ds.CurrentFileIndex + ds.MiniBatchSize - 1 ...
                <= ds.NumObservations;
        end

        function [data,info] = read(ds)            
            % Read one mini-batch batch of data
            miniBatchSize = ds.MiniBatchSize;
            info = struct;
            
            for i = 1:miniBatchSize
                predictors{i,1} = read(ds.Datastore);
                responses(i,1) = ds.Labels(ds.CurrentFileIndex);
                ds.CurrentFileIndex = ds.CurrentFileIndex + 1;
            end
            
            data = preprocessData(ds,predictors,responses);
        end

        function data = preprocessData(ds,predictors,responses)
            % data = preprocessData(ds,predictors,responses) preprocesses
            % the data in predictors and responses and returns the table
            % data
            
            miniBatchSize = ds.MiniBatchSize;
            
            % Pad data to length of longest sequence.
            sequenceLengths = cellfun(@(X) size(X,2),predictors);
            maxSequenceLength = max(sequenceLengths);
            for i = 1:miniBatchSize
                X = predictors{i};
                
                % Pad sequence with zeros.
                if size(X,2) < maxSequenceLength
                    X(:,maxSequenceLength) = 0;
                end
                
                predictors{i} = X;
            end
            
            % Return data as a table.
            data = table(predictors,responses);
        end

        function reset(ds)
            % Reset to the start of the data
            reset(ds.Datastore);
            ds.CurrentFileIndex = 1;
        end
        
    end 

    methods (Hidden = true)

        function frac = progress(ds)
            % Determine percentage of data read from datastore
            frac = (ds.CurrentFileIndex - 1) / ds.NumObservations;
        end

    end

end % end class definition

The implementation of the read method of your custom datastore uses a function called readSequence. You must create this function to read sequence data from a MAT-file.

function data = readSequence(filename)
% data = readSequence(filename) reads the sequence X from the MAT-file
% filename

S = load(filename);
data = S.X;
end

Add Support for Shuffling

To add support for shuffling, first follow the instructions in Implement MiniBatchable Datastore and then update your implementation code in MySequenceDatastore.m to:

Inherit from an additional class matlab.io.datastore.Shuffleable.
Define the additional method shuffle.

This example code adds shuffling support to the MySequenceDatastore class. Vertical ellipses indicate where you should copy code from the MySequenceDatastore implementation.

Steps Implementation

Steps	Implementation
Update the class definition to also inherit from the `matlab.io.datastore.Shuffleable` class. Add the definition for `shuffle` to the existing `methods` section.	classdef MySequenceDatastore < matlab.io.Datastore & ... matlab.io.datastore.MiniBatchable & ... matlab.io.datastore.Shuffleable % previously defined properties . . . methods % previously defined methods . . . function dsNew = shuffle(ds) % dsNew = shuffle(ds) shuffles the files and the % corresponding labels in the datastore. % Create a copy of datastore dsNew = copy(ds); dsNew.Datastore = copy(ds.Datastore); fds = dsNew.Datastore; % Shuffle files and corresponding labels numObservations = dsNew.NumObservations; idx = randperm(numObservations); fds.Files = fds.Files(idx); dsNew.Labels = dsNew.Labels(idx); end end end

Update the class definition to also inherit from the matlab.io.datastore.Shuffleable class.
Add the definition for shuffle to the existing methods section.

classdef MySequenceDatastore < matlab.io.Datastore & ...
                       matlab.io.datastore.MiniBatchable & ...
                       matlab.io.datastore.Shuffleable
   
   % previously defined properties 
   .
   .
   . 


   methods

        % previously defined methods
        .
        .
        . 
   
        function dsNew = shuffle(ds)
            % dsNew = shuffle(ds) shuffles the files and the
            % corresponding labels in the datastore.
            
            % Create a copy of datastore
            dsNew = copy(ds);
            dsNew.Datastore = copy(ds.Datastore);
            fds = dsNew.Datastore;
            
            % Shuffle files and corresponding labels
            numObservations = dsNew.NumObservations;
            idx = randperm(numObservations);
            fds.Files = fds.Files(idx);
            dsNew.Labels = dsNew.Labels(idx);
        end

     end

end

Validate Custom Mini-Batch Datastore

If you have followed all the instructions presented here, then the implementation of your custom mini-batch datastore is complete. Before using this datastore, qualify it using the guidelines presented in Testing Guidelines for Custom Datastores.

Related Examples

Train Network Using Custom Mini-Batch Datastore for Sequence Data