Main Content

matlab.io.datastore.Subsettable Class

Namespace: matlab.io.datastore

Add subset and fine-grained parallelization support to datastore

Since R2022b

Description

matlab.io.datastore.Subsettable is an abstract mixin class that adds subset and fine-grained parallelization support to your custom datastore for use with Parallel Computing Toolbox™ and MATLAB® Parallel Server™. matlab.io.datastore.Subsettable creates fine-grained subsets with the subset method, coarse-grained partitions with the partition method, and dataset randomization with the shuffle method.

Use matlab.io.datastore.Subsettable only if you can access every data read independently for increased granularity. If not, such as in TabularTextDatastore workflows, then matlab.io.datastore.Partitionable is more appropriate.

To use this mixin class, inherit from the matlab.io.datastore.Subsettable class, in addition to inheriting from the matlab.io.Datastore base class. Type this syntax as the first line of your class definition file:

classdef MyDatastore < matlab.io.Datastore & ...
                       matlab.io.datastore.Subsettable
    ...
end

To add support for parallel processing to your custom datastore, you must:

  • Inherit from the class matlab.io.datastore.Subsettable in addition to matlab.io.Datastore.

  • Define the method maxpartitions.

  • Define the method subsetByReadIndices. Subsettable uses the subset method to call the implementation of subsetByReadIndices.

For more details and steps to create your custom datastore with parallel processing support, see Develop Custom Datastore.

Class Attributes

Sealed
false

For information on class attributes, see Class Attributes.

Methods

expand all

Examples

collapse all

Build a datastore with subset processing support and use it to bring your data into MATLAB®.

Create a class definition file that contains the code implementing your datastore. Save this file in your working folder or in a folder that is on the MATLAB path. The name of the .m file must be the same as the name of your object constructor function. In this example, create the MyHDF5Datastore class in a file named MyHDF5Datastore.m. The .m class definition contains the following steps:

  • Step 1: Inherit from the matlab.io.Datastore and matlab.io.datastore.Subsettable classes.

  • Step 2: Define the constructor as well as the subsetByReadIndices and maxpartitions methods.

  • Step 3: Define your custom file-reading function. Here, the MyHDF5Datastore class creates and uses the listHDF5Datasets function.

%% STEP 1
classdef MyHDF5Datastore < matlab.io.Datastore ...
                       & matlab.io.datastore.Subsettable

    properties
        Filename            (1, 1) string
        Datasets            (:, 1) string {mustBeNonmissing} = "/"
        CurrentDatasetIndex (1, 1) double {mustBeInteger, mustBeNonnegative} = 1
    end

%% STEP 2
    methods
        function ds = MyHDF5Datastore(Filename, Location)
            arguments
                Filename (1, 1) string
                Location (1, 1) string {mustBeNonmissing} = "/"
            end

            ds.Filename = Filename;
            ds.Datasets = listHDF5Datasets(ds.Filename, Location);
        end

        function [data, info] = read(ds, varargin)
            if ~hasdata(ds)
                error(message("No more datasets to read."));
            end

            dataset = ds.Datasets(ds.CurrentDatasetIndex);
            data = { h5read(ds.Filename, dataset, varargin{:}) };
            if nargout > 1
                info =   h5info(ds.Filename, dataset);
            end

            ds.CurrentDatasetIndex = ds.CurrentDatasetIndex + 1;
        end

        function tf = hasdata(ds)
            tf = ds.CurrentDatasetIndex <= numel(ds.Datasets);
        end

        function reset(ds)
            ds.CurrentDatasetIndex = 1;
        end
    end

    methods (Access = protected)
        function subds = subsetByReadIndices(ds, indices)
            datasets = ds.Datasets(indices);

            subds = copy(ds);
            subds.Datasets = datasets;
            reset(subds);
        end

        function n = maxpartitions(ds)
            n = numel(ds.Datasets);
        end
    end
end

%% STEP 3
function datasets = listHDF5Datasets(filename, location, args)
    arguments
        filename (1, 1) string
        location (1, 1) string
        args.IncludeSubGroups (1, 1) logical = true
    end

    if strlength(location) == 0
        location = "/";
    end

    info = h5info(filename, location);

    datasets = listDatasetsInH5infoStruct(info, location, IncludeSubGroups=args.IncludeSubGroups);
end

function datasets = listDatasetsInH5infoStruct(S, location, args)
    arguments
        S (1, 1) struct
        location (1, 1) string
        args.IncludeSubGroups (1, 1) logical = true
    end

    datasets = string.empty(0, 1);

    if isfield(S, "Datatype")
        datasets = location;
    elseif isfield(S, "Datasets")
        if ~isempty(S.Datasets)
            datasets = location + "/" + {S.Datasets.Name}';
        end

        if args.IncludeSubGroups
            listFcn = @(group) listDatasetsInH5infoStruct(group, group.Name, IncludeSubGroups=true);
        else
            listFcn = @(group) string(group.Name);
        end

        childDatasets = arrayfun(listFcn, S.Groups, UniformOutput=false);
        childDatasets = vertcat(childDatasets{:});

        datasets = [datasets; childDatasets];
    end

end

Create a subset of datasets from a specific group of an HDF5 file.

First, create a datastore from all datasets under the /g4 group of the HDF5 file. Use the MyHDF5Datastore.m class definition file from the Build Datastore with Subset Support example.

g4ds = MyHDF5Datastore("example.h5","/g4");
data = readall(g4ds)
data=4×1 cell array
    {19x1  double}
    {36x1  double}
    {10x1  double}
    {36x19 double}

Select specific datasets from the g4ds datastore using the subset function.

subds = subset(g4ds,[2 4]);
data = readall(subds)
data=2×1 cell array
    {36x1  double}
    {36x19 double}

Tips

  • For your custom datastore implementation, a best practice is not to implement the numpartitions method.

Version History

Introduced in R2022b