Main Content

# matlab.tall.reduce

Reduce arrays by applying reduction algorithm to blocks of data

## Syntax

``tA = matlab.tall.reduce(fcn,reducefcn,tX)``
``tA = matlab.tall.reduce(fcn,reducefcn,tX,tY,...)``
``[tA,tB,...] = matlab.tall.reduce(fcn,reducefcn,tX,tY,...)``
``[tA,tB,...] = matlab.tall.reduce(___,'OutputsLike',{PA,PB,...})``

## Description

example

````tA = matlab.tall.reduce(fcn,reducefcn,tX)` applies the function `fcn` to each block of array `tX` to generate partial results. Then the function applies `reducefcn` to the vertical concatenation of partial results repeatedly until it has one final result, `tA`.```

example

````tA = matlab.tall.reduce(fcn,reducefcn,tX,tY,...)` specifies several arrays `tX,tY,...` that are inputs to `fcn`. The same rows of each array are operated on by `fcn`; for example, `fcn(tX(n:m,:),tY(n:m,:))`. Inputs with a height of one are passed to every call of `fcn`. With this syntax, `fcn` must return one output, and `reducefcn` must accept one input and return one output.```

example

````[tA,tB,...] = matlab.tall.reduce(fcn,reducefcn,tX,tY,...)` , where `fcn` and `reducefcn` are functions that return multiple outputs, returns arrays `tA,tB,...`, each corresponding to one of the output arguments of `fcn` and `reducefcn`. This syntax has these requirements: `fcn` must return the same number of outputs as were requested from `matlab.tall.reduce`.`reducefcn` must have the same number of inputs and outputs as the number of outputs requested from `matlab.tall.reduce`.Each output of `fcn` and `reducefcn` must be the same type as the first input `tX`.Corresponding outputs of `fcn` and `reducefcn` must have the same height. ```

example

````[tA,tB,...] = matlab.tall.reduce(___,'OutputsLike',{PA,PB,...})` specifies that the outputs `tA,tB,...` have the same data types as the prototype arrays `PA,PB,...`, respectively. You can use any of the input argument combinations in previous syntaxes.```

## Examples

collapse all

Create a tall table, extract a tall vector from the table, and then find the total number of elements in the vector.

Create a tall table for the `airlinesmall.csv` data set. The data contains information about arrival and departure times of US flights. Extract the `ArrDelay` variable, which is a vector of arrival delays.

```ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA'); ds.SelectedVariableNames = {'ArrDelay' 'DepDelay'}; tt = tall(ds); tX = tt.ArrDelay;```

Use `matlab.tall.reduce` to count the total number of non-`NaN` elements in the tall vector. The first function `numel` counts the number of elements in each block of data, and the second function `sum` adds together all of the counts for each block to produce a scalar result.

`s = matlab.tall.reduce(@numel,@sum,tX)`
```s = MxNx... tall double array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : : ```

Gather the result into memory.

`s = gather(s)`
```Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 1.5 sec Evaluation completed in 1.8 sec ```
```s = 123523 ```

Create a tall table, extract two tall vectors form the table, and then calculate the mean value of each vector.

Create a tall table for the `airlinesmall.csv` data set. The data contains information about arrival and departure times of US flights. Extract the `ArrDelay` and `DepDelay` variables, which are vectors of arrival and departure delays.

```ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA'); ds.SelectedVariableNames = {'ArrDelay' 'DepDelay'}; tt = tall(ds); tt = rmmissing(tt); tX = tt.ArrDelay; tY = tt.DepDelay;```

In the first stage of the algorithm, calculate the sum and element count for each block of data in the vectors. To do this you can write a function that accepts two inputs and returns one output with the sum and count for each input. This function is listed as a local function at the end of the example.

```function bx = sumcount(tx,ty) bx = [sum(tx) numel(tx) sum(ty) numel(ty)]; end ```

In the reduction stage of the algorithm, you need to add together all of the intermediate sums and counts. Thus, `matlab.tall.reduce` returns the overall sum of elements and number of elements for each input vector, and calculating the mean is then a simple division. For this step you can apply the `sum` function to the first dimension of the 1-by-4 vector outputs from the first stage.

``` reducefcn = @(x) sum(x,1); s = matlab.tall.reduce(@sumcount,reducefcn,tX,tY)```
```s = MxNx... tall double array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : : ```
` s = gather(s)`
```Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 2.4 sec Evaluation completed in 2.9 sec ```
```s = 1×4 860584 120866 982764 120866 ```

The first two elements of `s` are the sum and count for `tX`, and the second two elements are the sum and count for `tY`. Dividing the sums and counts yields the mean values, which you can compare to the answer returned by the `mean` function.

` my_mean = [s(1)/s(2) s(3)/s(4)]`
```my_mean = 1×2 7.1201 8.1310 ```
` m = gather(mean([tX tY]))`
```Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.64 sec Evaluation completed in 0.86 sec ```
```m = 1×2 7.1201 8.1310 ```

Local Functions

Listed here is the `sumcount` function that `matlab.tall.reduce` calls to calculate the intermediate sums and element counts.

```function bx = sumcount(tx,ty) bx = [sum(tx) numel(tx) sum(ty) numel(ty)]; end```

Create a tall table, then calculate the mean flight delay for each year in the data.

Create a tall table for the `airlinesmall.csv` data set. The data contains information about arrival and departure times of US flights. Remove rows of missing data from the table and extract the `ArrDelay`, `DepDelay`, and `Year` variables. These variables are vectors of arrival and departure delays and of the associated years for each flight in the data set.

```ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA'); ds.SelectedVariableNames = {'ArrDelay' 'DepDelay' 'Year'}; tt = tall(ds); tt = rmmissing(tt);```

Use `matlab.tall.reduce` to apply two functions to the tall table. The first function combines the `ArrDelay` and `DepDelay` variables to find the total mean delay for each flight. The function determines how many unique years are in each chunk of data, and then cycles through each year and calculates the average total delay for flights in that year. The result is a two-variable table containing the year and mean total delay. This intermediate data needs to be reduced further to arrive at the mean delay per year. Save this function in your current folder as `transform_fcn.m`.

`type transform_fcn`
```function t = transform_fcn(a,b,c) ii = gather(unique(c)); for k = 1:length(ii) jj = (c == ii(k)); d = mean([a(jj) b(jj)], 2); if k == 1 t = table(c(jj),d,'VariableNames',{'Year' 'MeanDelay'}); else t = [t; table(c(jj),d,'VariableNames',{'Year' 'MeanDelay'})]; end end end ```

The second function uses the results from the first function to calculate the mean total delay for each year. The output from `reduce_fcn` is compatible with the output from `transform_fcn`, so that blocks of data can be concatenated in any order and continually reduced until only one row remains for each year.

`type reduce_fcn`
```function TT = reduce_fcn(t) [groups,Y] = findgroups(t.Year); D = splitapply(@mean, t.MeanDelay, groups); TT = table(Y,D,'VariableNames',{'Year' 'MeanDelay'}); end ```

Apply the transform and reduce functions to the tall vectors. Since the inputs (type `double`) and outputs (type `table`) have different data types, use the `'OutputsLike'` name-value pair to specify that the output is a table. A simple way to specify the type of the output is to call the transform function with dummy inputs.

```a = tt.ArrDelay; b = tt.DepDelay; c = tt.Year; d1 = matlab.tall.reduce(@transform_fcn, @reduce_fcn, a, b, c, 'OutputsLike',{transform_fcn(0,0,0)})```
```d1 = Mx2 tall table Year MeanDelay ____ _________ ? ? ? ? ? ? : : : : ```

Gather the results into memory to see the mean total flight delay per year.

`d1 = gather(d1)`
```Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 2.5 sec Evaluation completed in 2.9 sec ```
```d1=22×2 table Year MeanDelay ____ _________ 1987 7.6889 1988 6.7918 1989 8.0757 1990 7.1548 1991 4.0134 1992 5.1767 1993 5.4941 1994 6.0303 1995 8.4284 1996 9.6981 1997 8.4346 1998 8.3789 1999 8.9121 2000 10.595 2001 6.8975 2002 3.4325 ⋮ ```

Alternative Approach

Another way to calculate the same statistics by group is to use `splitapply` to call `matlab.tall.reduce` (rather than using `matlab.tall.reduce` to call `splitapply`).

Using this approach, you call `findgroups` and `splitapply` directly on the data. The function `mySplitFcn` that operates on each group of data includes a call to `matlab.tall.reduce`. The transform and reduce functions employed by `matlab.tall.reduce` do not need to group the data, so those functions just perform calculations on the pregrouped data that `splitapply` passes to them.

`type mySplitFcn`
```function T = mySplitFcn(a,b,c) T = matlab.tall.reduce(@non_group_transform_fcn, @non_group_reduce_fcn, ... a, b, c, 'OutputsLike', {non_group_transform_fcn(0,0,0)}); function t = non_group_transform_fcn(a,b,c) d = mean([a b], 2); t = table(c,d,'VariableNames',{'Year' 'MeanDelay'}); end function TT = non_group_reduce_fcn(t) D = mean(t.MeanDelay); TT = table(t.Year(1),D,'VariableNames',{'Year' 'MeanDelay'}); end end ```

Call `findgroups` and `splitapply` to operate on the data and apply `mySplitFcn` to each group of data.

```groups = findgroups(c); d2 = splitapply(@mySplitFcn, a, b, c, groups); d2 = gather(d2)```
```Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.91 sec - Pass 2 of 2: Completed in 2.2 sec Evaluation completed in 4.1 sec ```
```d2=22×2 table Year MeanDelay ____ _________ 1987 7.6889 1988 6.7918 1989 8.0757 1990 7.1548 1991 4.0134 1992 5.1767 1993 5.4941 1994 6.0303 1995 8.4284 1996 9.6981 1997 8.4346 1998 8.3789 1999 8.9121 2000 10.595 2001 6.8975 2002 3.4325 ⋮ ```

Calculate weighted standard deviation and variance of a tall array using a vector of weights. This is one example of how you can use `matlab.tall.reduce` to work around functionality that tall arrays do not support yet.

Create two tall vectors of random data. `tX` contains random data, and `tP` contains corresponding probabilities such that `sum(tP)` is `1`. These probabilities are suitable to weight the data.

```rng default tX = tall(rand(1e4,1)); p = rand(1e4,1); tP = tall(normalize(p,'scale',sum(p)));```

Write an identity function that returns outputs equal to the inputs. This approach skips the transform step of `matlab.tall.reduce` and passes the data directly to the reduction step, where the reduction function is repeatedly applied to reduce the size of the data.

`type identityTransform.m`
```function [A,B] = identityTransform(X,Y) A = X; B = Y; end ```

Next, write a reduction function that operates on blocks of the tall vectors to calculate the weighted variance and standard deviation.

`type weightedStats.m`
```function [wvar, wstd] = weightedStats(X, P) wvar = var(X,P); wstd = std(X,P); end ```

Use `matlab.tall.reduce` to apply these functions to the blocks of data in the tall vectors.

`[tX_var_weighted, tX_std_weighted] = matlab.tall.reduce(@identityTransform, @weightedStats, tX, tP)`
```tX_var_weighted = MxNx... tall double array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : : tX_std_weighted = MxNx... tall double array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : : ```

## Input Arguments

collapse all

Transform function to apply, specified as a function handle or anonymous function. Each output of `fcn` must be the same type as the first input `tX`. You can use the `'OutputsLike'` option to return outputs of different data types. If `fcn` returns more than one output, then the outputs must all have the same height.

The general functional signature of `fcn` is

`[a, b, c, ...] = fcn(x, y, z, ...)`
`fcn` must satisfy these requirements:

1. Input Arguments — The inputs ```[x, y, z, ...]``` are blocks of data that fit in memory. The blocks are produced by extracting data from the respective tall array inputs ```[tX, tY, tZ, ...]```. The inputs `[x, y, z, ...]` satisfy these properties:

• All of `[x, y, z, ...]` have the same size in the first dimension after any allowed expansion.

• The blocks of data in `[x, y, z, ...]` come from the same index in the tall dimension, assuming the tall array is nonsingleton in the tall dimension. For example, if `tX` and `tY` are nonsingleton in the tall dimension, then the first set of blocks might be `x = tX(1:20000,:)` and `y = tY(1:20000,:)`.

• If the first dimension of any of `[tX, tY, tZ, ...]` has a size of `1`, then the corresponding block ```[x, y, z, ...]``` consists of all the data in that tall array.

2. Output Arguments — The outputs ```[a, b, c, ...]``` are blocks that fit in memory, to be sent to the respective outputs `[tA, tB, tC, ...]`. The outputs ```[a, b, c, ...]``` satisfy these properties:

• All of `[a, b, c, ...]` must have the same size in the first dimension.

• All of `[a, b, c, ...]` are vertically concatenated with the respective results of previous calls to `fcn`.

• All of `[a, b, c, ...]` are sent to the same index in the first dimension in their respective destination output arrays.

3. Functional Rules`fcn` must satisfy the functional rule:

• `F([inputs1; inputs2]) == [F(inputs1); F(inputs2)]`: Applying the function to the concatenation of the inputs should be the same as applying the function to the inputs separately and then concatenating the results.

4. Empty Inputs — Ensure that `fcn` can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data.

For example, this function accepts two input arrays, squares them, and returns two output arrays:

```function [xx,yy] = sqInputs(x,y) xx = x.^2; yy = y.^2; end ```
After you save this function to an accessible folder, you can invoke the function to square `tX` and `tY` and find the maximum value with this command:
`tA = matlab.tall.reduce(@sqInputs, @max, tX, tY)`

Example: `tC = matlab.tall.reduce(@numel,@sum,tX,tY)` finds the number of elements in each block, and then it sums the results to count the total number of elements.

Data Types: `function_handle`

Reduction function to apply, specified as a function handle or anonymous function. Each output of `reducefcn` must be the same type as the first input `tX`. You can use the `'OutputsLike'` option to return outputs of different data types. If `reducefcn` returns more than one output, then the outputs must all have the same height.

The general functional signature of `reducefcn` is

`[rA, rB, rC, ...] = reducefcn(a, b, c, ...)`
`reducefcn` must satisfy these requirements:

1. Input Arguments — The inputs ```[a, b, c, ...]``` are blocks that fit in memory. The blocks of data are either outputs returned by `fcn`, or a partially reduced output from `reducefcn` that is being operated on again for further reduction. The inputs `[a, b, c, ...]` satisfy these properties:

• The inputs `[a, b, c, ...]` have the same size in the first dimension.

• For a given index in the first dimension, every row of the blocks of data `[a, b, c, ...]` either originates from the input, or originates from the same previous call to `reducefcn`.

• For a given index in the first dimension, every row of the inputs `[a, b, c, ...]` for that index originates from the same index in the first dimension.

2. Output Arguments — All outputs `[rA, rB, rC, ...]` must have the same size in the first dimension. Additionally, they must be vertically concatenable with the respective inputs `[a, b, c, ...]` to allow for repeated reductions when necessary.

3. Functional Rules`reducefcn` must satisfy these functional rules (up to roundoff error):

• `F(input) == F(F(input))`: Applying the function repeatedly to the same inputs should not change the result.

• `F([input1; input2]) == F([input2; input1])`: The result should not depend on the order of concatenation.

• `F([input1; input2]) == F([F(input1); F(input2)])`: Applying the function once to the concatenation of some intermediate results should be the same as applying it separately, concatenating, and applying it again.

4. Empty Inputs — Ensure that `reducefcn` can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data. For this call, all input blocks are empty arrays of the correct type and size in dimensions beyond the first.

Some examples of suitable reduction functions are built-in dimension reduction functions such as `sum`, `prod`, `max`, and so on. These functions can work on intermediate results produced by `fcn` and return a single scalar. These functions have the properties that the order in which concatenations occur and the number of times the reduction operation is applied do not change the final answer. Some functions, such as `mean` and `var`, should generally be avoided as reduction functions because the number of times the reduction operation is applied can change the final answer.

Example: `tC = matlab.tall.reduce(@numel,@sum,tX)` finds the number of elements in each block, and then it sums the results to count the total number of elements.

Data Types: `function_handle`

Input arrays, specified as scalars, vectors, matrices, or multidimensional arrays. The input arrays are used as inputs to the transform function `fcn`. Each input array `tX,tY,...` must have compatible heights. Two inputs have compatible height when they have the same height, or when one input is of height one.

Prototype of output arrays, specified as arrays. When you specify `'OutputsLike'`, the output arrays `tA,tB,...` returned by `matlab.tall.reduce` have the same data types and attributes as the specified arrays `{PA,PB,...}`.

Example: ```tA = matlab.tall.reduce(fcn,reducefcn,tX,'OutputsLike',{int8(1)});```, where `tX` is a double-precision tall array, returns `tA` as `int8` instead of `double`.

## Output Arguments

collapse all

Output arrays, returned as scalars, vectors, matrices, or multidimensional arrays. If any input to `matlab.tall.reduce` is tall, then all output arguments are also tall. Otherwise, all output arguments are in-memory arrays.

The size and data type of the output arrays depend on the specified functions `fcn` and `reducefcn`. In general, the outputs `tA,tB,...` must all have the same data type as the first input `tX`. However, you can specify `'OutputsLike'` to return different data types. The output arrays `tA,tB,...` all have the same height.

## More About

collapse all

### Tall Array Blocks

When you create a tall array from a datastore, the underlying datastore facilitates the movement of data during a calculation. The data moves in discrete pieces called blocks or chunks, where each block is a set of consecutive rows that can fit in memory. For example, one block of a 2-D array (such as a table) is `X(n:m,:)`, for some subscripts `n` and `m`. The size of each block is based on the value of the `ReadSize` property of the datastore, but the block might not be exactly that size. For the purposes of `matlab.tall.reduce`, a tall array is considered to be the vertical concatenation of many such blocks: For example, if you use the `sum` function as the transform function, the intermediate result is the sum per block. Therefore, instead of returning a single scalar value for the sum of the elements, the result is a vector with length equal to the number of blocks.

```ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA'); ds.SelectedVariableNames = {'ArrDelay' 'DepDelay'}; tt = tall(ds); tX = tt.ArrDelay; f = @(x) sum(x,'omitnan'); s = matlab.tall.reduce(f, @(x) x, tX); s = gather(s)```
```s = 140467 101065 164355 135920 111182 186274 21321```

## See Also

### Topics

Introduced in R2018b

Download ebook