# incrementalLearner

## Syntax

## Description

```
```

returns a robust random cut forest (RRCF) model `IncrementalForest`

= incrementalLearner(`forest`

)`IncrementalForest`

for anomaly detection, initialized using the parameters provided in the RRCF model
`forest`

. Because its property values reflect the knowledge gained
from `forest`

, `IncrementalForest`

can detect
anomalies given new observations, and it is *warm*, meaning that
the incremental `fit`

function can return scores and detect
anomalies.

```
```

specifies additional options using one or more
name-value arguments. For example, `IncrementalForest`

= incrementalLearner(`forest`

,`Name=Value`

)`ScoreWarmupPeriod=500`

specifies
to process 500 observations before score computation and anomaly detection.

## Examples

### Perform Incremental RRCF Anomaly Detection with Categorical Predictor Data

Train an incremental robust random cut forest (RRCF) model and perform anomaly detection on a data set with categorical predictors.

**Load Data**

Load `census1994.mat`

. The data set consists of demographic data from the US Census Bureau.

`load census1994.mat`

`incrementalRobustRandomCutForest`

does not use observations with missing values. Remove missing values in the data to reduce memory consumption and speed up training. Keep only the first 1000 observations in the training data set and the first 2000 observations in the test data set.

adultdata = rmmissing(adultdata); adulttest = rmmissing(adulttest); Xtrain = adultdata(1:1000,:); Xstream = adulttest(1:2000,:);

**Train RRCF Model**

Fit an RRCF model to the training data. Specify an anomaly contamination fraction of 0.001.

rng(0,"twister"); % For reproducibility TTforest = rrcforest(Xtrain,ContaminationFraction=0.001); details(TTforest)

RobustRandomCutForest with properties: CollusiveDisplacement: 'maximal' NumLearners: 100 NumObservationsPerLearner: 256 Mu: [] Sigma: [] CategoricalPredictors: [2 4 6 7 8 9 10 14 15] ContaminationFraction: 1.0000e-03 ScoreThreshold: 55.5745 PredictorNames: {'age' 'workClass' 'fnlwgt' 'education' 'education_num' 'marital_status' 'occupation' 'relationship' 'race' 'sex' 'capital_gain' 'capital_loss' 'hours_per_week' 'native_country' 'salary'}

`TTforest`

is a `RobustRandomCutForest`

model object representing a traditionally trained RRCF model. The software identifies nine variables in the data as categorical predictors because they contain string arrays.

**Convert Trained Model**

Convert the traditionally trained RRCF model to an RRCF model for incremental learning.

Incrementalforest = incrementalLearner(TTforest);

`Incrementalforest`

is an `incrementalRobustRandomCutForest`

model object that is ready for incremental learning and anomaly detection.

**Fit Incremental Model and Detect Anomalies**

Perform incremental learning on the `Xstream`

data by using the `fit`

function. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:

Process 100 observations.

Overwrite the previous incremental model with a new one fitted to the incoming observations.

Store

`medianscore`

, the median score value of the data chunk, to see how it evolves during incremental learning.Store

`threshold`

, the score threshold value for anomalies, to see how it evolves during incremental learning.Store

`numAnom`

, the number of detected anomalies in the chunk, to see how it evolves during incremental learning.

n = numel(Xstream(:,1)); numObsPerChunk = 100; nchunk = floor(n/numObsPerChunk); medianscore = zeros(nchunk,1); numAnom = zeros(nchunk,1); threshold = zeros(nchunk,1); % Incremental fitting for j = 1:nchunk ibegin = min(n,numObsPerChunk*(j-1) + 1); iend = min(n,numObsPerChunk*j); idx = ibegin:iend; [Incrementalforest,tf,scores] = fit(Incrementalforest,Xstream(idx,:)); medianscore(j) = median(scores); numAnom(j) = sum(tf); threshold(j) = Incrementalforest.ScoreThreshold; end

**Analyze Incremental Model During Training**

To see how the median score, score threshold, and number of detected anomalies per chunk evolve during training, plot them on separate tiles.

tiledlayout(3,1); nexttile plot(medianscore) ylabel("Median Score") xlabel("Iteration") xlim([0 nchunk]) nexttile plot(threshold) ylabel("Score Threshold") xlabel("Iteration") xlim([0 nchunk]) nexttile plot(numAnom,"+") ylabel("Anomalies") xlabel("Iteration") xlim([0 nchunk]) ylim([0 max(numAnom)+0.2])

totalanomalies=sum(numAnom)

totalanomalies = 1

anomfrac= totalanomalies/n

anomfrac = 5.0000e-04

`fit`

updates the model and returns the observation scores and the indices of observations with scores above the score threshold value as anomalies. A high score value indicates a normal observation, and a low value indicates an anomaly. The median score fluctuates between approximately 230 and 270. The score threshold rises from a value of 260 after the first iteration and steadily approaches 285 after 12 iterations. The software detected 4 anomalies in the `Xstream`

data, yielding a total contamination fraction of 0.002.

### Incrementally Train RRCF Model on Shingled Data

Train a robust random cut forest (RRCF) model on a simulated, noisy, periodic shingled time series containing no anomalies by using `rrcforest`

. Convert the trained model to an incremental learner object, and then incrementally fit the time series and detect anomalies.

**Create Simulated Data Stream**

Create a simulated data stream of observations representing a noisy sinusoid signal.

rng(0,"twister"); % For reproducibility period = 100; n = 2001+period; sigma = 0.04; a = linspace(1,n,n)'; b = sin(2*pi*(a-1)/period)+sigma*randn(n,1);

Introduce an anomalous region into the data stream. Plot the data stream portion that contains the anomalous region, and circle the anomalous data points.

c = 2*(sin(2*pi*(a-35)/period)+sigma*randn(n,1)); b(1150:1170) = c(1150:1170); scatter(a,b,".") xlim([900,1200]) xlabel("Observation") hold on scatter(a(1150:1170),b(1150:1170),"r") hold off

Convert the single-featured data set `b`

into a multi-featured data set by shingling [1] with a shingle size equal to the period of the signal. The $$i$$th shingled observation is a vector of $$k$$ features with values $${b}_{i}$$, $${b}_{i+1}$$, ..., $${b}_{i+k-1}$$, where $$k$$ is the shingle size.

X = []; shingleSize = period; for i = 1:n-shingleSize X = [X;b(i:i+shingleSize-1)']; end

**Train Model and Perform Incremental Anomaly Detection**

Fit a robust random cut forest model to the first 1000 shingled observations, specifying a contamination fraction of 0. Convert the model to an `incrementalRobustRandomCutForest`

model object. Specify to keep the 100 most recent observations relevant for anomaly detection.

Mdl = rrcforest(X(1:1000,:),ContaminationFraction=0); IncrementalMdl = incrementalLearner(Mdl,NumObservationsToKeep=100);

To simulate a data stream, process the full shingled data set in chunks of 100 observations at a time. At each iteration:

Process 100 observations.

Calculate scores and detect anomalies using the

`isanomaly`

function.Store

`anomIdx`

, the indices of shingled observations marked as anomalies.If the chunk contains fewer than three anomalies, fit and update the previous incremental model.

n = numel(X(:,1)); numObsPerChunk = 100; nchunk = floor(n/numObsPerChunk); anomIdx = []; allscores = []; % Incremental fitting rng("default"); % For reproducibility for j = 1:nchunk ibegin = min(n,numObsPerChunk*(j-1) + 1); iend = min(n,numObsPerChunk*j); idx = ibegin:iend; [isanom,scores] = isanomaly(IncrementalMdl,X(idx,:)); allscores = [allscores;scores]; anomIdx = [anomIdx;find(isanom)+ibegin-1]; if (sum(isanom) < 3) IncrementalMdl = fit(IncrementalMdl,X(idx,:)); end end

**Analyze Incremental Model During Training**

At each iteration, the software calculates a score value for each observation in the data chunk. A negative score value with large magnitude indicates a normal observation, and a large positive value indicates an anomaly. Plot the anomaly score for the observations in the vicinity of the anomaly. Circle the scores of shingles that the software returns as anomalous.

figure scatter(a(1:2000),allscores,".") hold on scatter(a(anomIdx),allscores(anomIdx),20,"or") xlim([900,1200]) xlabel("Shingle") ylabel("Score") hold off

Because the introduced anomalous region begins at observation 1150, and the shingle size is 100, shingle 1051 is the first to show a high anomaly score. Some shingles between 1050 and 1170 have scores lying just below the anomaly score threshold, due to the noise in the sinusoidal signal. The shingle size affects the performance of the model by defining how many subsequent consecutive data points in the original time series the software uses to calculate the anomaly score for each shingle.

Plot the unshingled data and highlight the introduced anomalous region. Circle the observation number of the first element in each shingle returned by that the software as anomalous.

figure xlim([900,1200]) ylim([-1.5 2]) rectangle(Position=[1150 -1.5 20 3.5],FaceColor=[0.9 0.9 0.9], ... EdgeColor=[0.9 0.9 0.9]) hold on scatter(a,b,".") scatter(a(anomIdx),b(anomIdx),20,"or") xlabel("Observation") hold off

## Input Arguments

`forest`

— Traditionally trained RRCF model for anomaly detection

`RobustRandomCutForest`

model object

Traditionally trained RRCF model for anomaly detection, specified as a `RobustRandomCutForest`

model object returned by `rrcforest`

.

### Name-Value Arguments

Specify optional pairs of arguments as
`Name1=Value1,...,NameN=ValueN`

, where `Name`

is
the argument name and `Value`

is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.

**Example: **
`incrementalLearner(forest,ObservationRemoval="timedecaying",ScoreWarmupPeriod=500)`

sets the observation removal method to `"timedecaying"`

and specifies
to process 500 observations before the incremental `fit`

function
returns scores and detects anomalies.

`NumObservationsToKeep`

— Number of most recent observations relevant for anomaly detection

`forest.NumObservationsPerLearner`

(default) | nonnegative integer

Number of the most recent observations relevant for anomaly detection, specified as a nonnegative integer.

**Example: **
`NumObservationsToKeep=250`

**Data Types: **`single`

| `double`

`ObservationRemoval`

— Observation removal method

`"oldest"`

(default) | `"timedecaying"`

| `"random"`

Observation removal method, specified as `"oldest"`

,
`"timedecaying"`

, or `"random"`

. When the robust
random cut trees reach their capacity, the software removes old observations to
accommodate the most recent data.

Value | Description |
---|---|

| Oldest observations are removed first. |

| Observations are removed randomly in a weighted fashion. Older observations have a higher probability of being removed first. |

| Observations are removed in random order. |

**Data Types: **`string`

| `char`

`ScoreWarmupPeriod`

— Warm-up period before score computation and anomaly detection

`0`

(default) | nonnegative integer

Warm-up period before score computation and anomaly detection, specified as
a nonnegative integer. This option specifies the number of observations used by
the incremental `fit`

function to train the model and
estimate the score threshold.

**Note**

When processing observations during the score warm-up period, the software ignores observations that contain missing values for all predictors.

**Example: **
`ScoreWarmupPeriod=200`

**Data Types: **`single`

| `double`

`ScoreWindowSize`

— Running window size used to estimate score threshold

`1000`

(default) | positive integer

Running window size used to estimate the score threshold
(`ScoreThreshold`

), specified as a positive integer. The
default `ScoreWindowSize`

value is
`1000`

.

If `ScoreWindowSize`

is greater than the number of
observations in the training data, the software determines
`ScoreThreshold`

by subsampling from the training data.
Otherwise, `ScoreThreshold`

is set to
`forest.ScoreThreshold`

.

**Example: **
`ScoreWindowSize=100`

**Data Types: **`single`

| `double`

`UseParallel`

— Flag to run in parallel

`false`

or `0`

(default) | `true`

or `1`

Flag to run in parallel, specified as a numeric or logical 1
(`true`

) or 0 (`false`

). If you specify
`UseParallel=true`

, the `incrementalLearner`

function executes
`for`

-loop iterations by using `parfor`

. The loop runs in parallel when you have Parallel Computing Toolbox™.

**Example: **`UseParallel=true`

**Data Types: **`logical`

## Output Arguments

`IncrementalForest`

— RRCF model for incremental anomaly detection

`incrementalRobustRandomCutForest`

model object

RRCF model for incremental anomaly detection, returned as an `incrementalRobustRandomCutForest`

model object.

To initialize `IncrementalForest`

for incremental anomaly
detection, ```
incrementalLearner
```

passes the values of the following properties of
`forest`

to the corresponding properties of
`IncrementalForest`

.

Property | Description |
---|---|

`CategoricalPredictors` | Categorical predictor indices, a vector of positive integers |

`ContaminationFraction` | Fraction of anomalies in the training data, a numeric scalar in
the range `[0,1]` |

`Mu`
| Predictor means of the training data, a numeric vector |

`NumLearners` | Number of robust random cut trees, a positive integer scalar |

`NumObservationsPerLearner`
| Number of observations for each robust random cut tree, a nonnegative integer |

`PredictorNames`
| Predictor variable names, a cell array of character vectors |

`ScoreThreshold`
| Threshold score for anomalies in the training data, a numeric
scalar in the range [0,`Inf` ). If
`ScoreWindowSize` is greater than the number
of observations used to train `forest` , then ```
incrementalLearner
``` approximates
`ScoreThreshold` by subsampling from the
training data. Otherwise, ```
incrementalLearner
``` passes
`forest.ScoreThreshold` to
`IncrementalForest.ScoreThreshold` . |

`Sigma`
| Predictor standard deviations of the training data, a numeric vector |

## More About

### Incremental Learning for Anomaly Detection

*Incremental learning*, or *online learning*, is a branch of machine learning concerned with processing incoming data from a data stream, possibly given little to no knowledge of the distribution of the predictor variables, aspects of the prediction or objective function (including tuning parameter values), or whether the observations contain anomalies. Incremental learning differs from traditional machine learning, where enough data is available to fit to a model, perform cross-validation to tune hyperparameters, and infer the predictor distribution.

Anomaly detection is used to identify unexpected events and departures from normal
behavior. In situations where the full data set is not immediately available, or new data is
arriving, you can use *incremental learning for anomaly detection* to
incrementally train a model so it adjusts to the characteristics of the incoming
data.

Given incoming observations, an incremental learning model for anomaly detection does the following:

Computes anomaly scores

Updates the anomaly score threshold

Detects data points above the score threshold as anomalies

Fits the model to the incoming observations

For more information, see Incremental Anomaly Detection with MATLAB.

## References

[1] Guha, Sudipto, N. Mishra, G. Roy, and O. Schrijvers. "Robust Random Cut Forest Based Anomaly Detection on Streams," *Proceedings of The 33rd International Conference on Machine Learning* 48 (June 2016): 2712–21.

[2] Bartos, Matthew D., A. Mullapudi, and S. C. Troutman. "rrcf: Implementation of the Robust Random Cut Forest Algorithm for Anomaly Detection on Streams." *Journal of Open Source Software* 4, no. 35 (2019): 1336.

## Version History

**Introduced in R2023b**

## See Also

### Functions

### Objects

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)