incrementalLearner

Convert robust random cut forest model to incremental learner

Since R2023b

Syntax


               IncrementalForest = incrementalLearner(forest)


               IncrementalForest = incrementalLearner(forest,Name=Value)

Description

IncrementalForest = incrementalLearner(forest) returns a robust random cut forest (RRCF) model IncrementalForest for anomaly detection, initialized using the parameters provided in the RRCF model forest. Because its property values reflect the knowledge gained from forest, IncrementalForest can detect anomalies given new observations, and it is warm, meaning that the incremental fit function can return scores and detect anomalies.

example

IncrementalForest = incrementalLearner(forest,Name=Value) specifies additional options using one or more name-value arguments. For example, ScoreWarmupPeriod=500 specifies to process 500 observations before score computation and anomaly detection.

example

Examples

collapse all

Perform Incremental RRCF Anomaly Detection with Categorical Predictor Data

Open Live Script

Train an incremental robust random cut forest (RRCF) model and perform anomaly detection on a data set with categorical predictors.

Load Data

Load census1994.mat. The data set consists of demographic data from the US Census Bureau.

load census1994.mat

incrementalRobustRandomCutForest does not use observations with missing values. Remove missing values in the data to reduce memory consumption and speed up training. Keep only the first 1000 observations in the training data set and the first 2000 observations in the test data set.

adultdata = rmmissing(adultdata);
adulttest = rmmissing(adulttest);
Xtrain = adultdata(1:1000,:);
Xstream = adulttest(1:2000,:);

Train RRCF Model

Fit an RRCF model to the training data. Specify an anomaly contamination fraction of 0.001.

rng(0,"twister"); % For reproducibility
TTforest = rrcforest(Xtrain,ContaminationFraction=0.001);
details(TTforest)

  RobustRandomCutForest with properties:

        CollusiveDisplacement: 'maximal'
                  NumLearners: 100
    NumObservationsPerLearner: 256
                           Mu: []
                        Sigma: []
        CategoricalPredictors: [2 4 6 7 8 9 10 14 15]
        ContaminationFraction: 1.0000e-03
               ScoreThreshold: 55.5745
               PredictorNames: {'age'  'workClass'  'fnlwgt'  'education'  'education_num'  'marital_status'  'occupation'  'relationship'  'race'  'sex'  'capital_gain'  'capital_loss'  'hours_per_week'  'native_country'  'salary'}

  Methods, Superclasses

TTforest is a RobustRandomCutForest model object representing a traditionally trained RRCF model. The software identifies nine variables in the data as categorical predictors because they contain string arrays.

Convert Trained Model

Convert the traditionally trained RRCF model to an RRCF model for incremental learning.

Incrementalforest = incrementalLearner(TTforest);

Incrementalforest is an incrementalRobustRandomCutForest model object that is ready for incremental learning and anomaly detection.

Fit Incremental Model and Detect Anomalies

Perform incremental learning on the Xstream data by using the fit function. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:

Process 100 observations.
Overwrite the previous incremental model with a new one fitted to the incoming observations.
Store medianscore, the median score value of the data chunk, to see how it evolves during incremental learning.
Store threshold, the score threshold value for anomalies, to see how it evolves during incremental learning.
Store numAnom, the number of detected anomalies in the chunk, to see how it evolves during incremental learning.

n = numel(Xstream(:,1));
numObsPerChunk = 100;
nchunk = floor(n/numObsPerChunk);
medianscore = zeros(nchunk,1);
numAnom = zeros(nchunk,1);
threshold = zeros(nchunk,1);

% Incremental fitting
for j = 1:nchunk
    ibegin = min(n,numObsPerChunk*(j-1) + 1);
    iend = min(n,numObsPerChunk*j);
    idx = ibegin:iend;    
    [Incrementalforest,tf,scores] = fit(Incrementalforest,Xstream(idx,:));
    medianscore(j) = median(scores);
    numAnom(j) = sum(tf);
    threshold(j) = Incrementalforest.ScoreThreshold;
end

Analyze Incremental Model During Training

To see how the median score, score threshold, and number of detected anomalies per chunk evolve during training, plot them on separate tiles.

tiledlayout(3,1);
nexttile
plot(medianscore)
ylabel("Median Score")
xlabel("Iteration")
xlim([0 nchunk])
nexttile
plot(threshold)
ylabel("Score Threshold")
xlabel("Iteration")
xlim([0 nchunk])
nexttile
plot(numAnom,"+")
ylabel("Anomalies")
xlabel("Iteration")
xlim([0 nchunk])
ylim([0 max(numAnom)+0.2])

Figure contains 3 axes objects. Axes object 1 with xlabel Iteration, ylabel Median Score contains an object of type line. Axes object 2 with xlabel Iteration, ylabel Score Threshold contains an object of type line. Axes object 3 with xlabel Iteration, ylabel Anomalies contains a line object which displays its values using only markers.

totalanomalies=sum(numAnom)

totalanomalies = 
1

anomfrac= totalanomalies/n

anomfrac = 
5.0000e-04

fit updates the model and returns the observation scores and the indices of observations with scores above the score threshold value as anomalies. A high score value indicates a normal observation, and a low value indicates an anomaly. The median score fluctuates between approximately 230 and 270. The score threshold rises from a value of 260 after the first iteration and steadily approaches 285 after 12 iterations. The software detected 4 anomalies in the Xstream data, yielding a total contamination fraction of 0.002.

Incrementally Train RRCF Model on Shingled Data

Open Live Script

Train a robust random cut forest (RRCF) model on a simulated, noisy, periodic shingled time series containing no anomalies by using rrcforest. Convert the trained model to an incremental learner object, and then incrementally fit the time series and detect anomalies.

Create Simulated Data Stream

Create a simulated data stream of observations representing a noisy sinusoid signal.

rng(0,"twister"); % For reproducibility
period = 100;
n = 2001+period;
sigma = 0.04;
a = linspace(1,n,n)';
b = sin(2*pi*(a-1)/period)+sigma*randn(n,1);

Introduce an anomalous region into the data stream. Plot the data stream portion that contains the anomalous region, and circle the anomalous data points.

c = 2*(sin(2*pi*(a-35)/period)+sigma*randn(n,1));
b(1150:1170) = c(1150:1170);
scatter(a,b,".")
xlim([900,1200])
xlabel("Observation")
hold on
scatter(a(1150:1170),b(1150:1170),"r")
hold off

Figure contains an axes object. The axes object with xlabel Observation contains 2 objects of type scatter.

Convert the single-featured data set b into a multi-featured data set by shingling [1] with a shingle size equal to the period of the signal. The $i$ th shingled observation is a vector of $k$ features with values $b_{i}$ , $b_{i + 1}$ , ..., $b_{i + k - 1}$ , where $k$ is the shingle size.

X = [];
shingleSize = period;
for i = 1:n-shingleSize
    X = [X;b(i:i+shingleSize-1)'];
end

Train Model and Perform Incremental Anomaly Detection

Fit a robust random cut forest model to the first 1000 shingled observations, specifying a contamination fraction of 0. Convert the model to an incrementalRobustRandomCutForest model object. Specify to keep the 100 most recent observations relevant for anomaly detection.

Mdl = rrcforest(X(1:1000,:),ContaminationFraction=0);
IncrementalMdl = incrementalLearner(Mdl,NumObservationsToKeep=100);

To simulate a data stream, process the full shingled data set in chunks of 100 observations at a time. At each iteration:

Process 100 observations.
Calculate scores and detect anomalies using the isanomaly function.
Store anomIdx, the indices of shingled observations marked as anomalies.
If the chunk contains fewer than three anomalies, fit and update the previous incremental model.

n = numel(X(:,1));
numObsPerChunk = 100;
nchunk = floor(n/numObsPerChunk);
anomIdx = [];
allscores = [];

% Incremental fitting
rng("default"); % For reproducibility
for j = 1:nchunk
    ibegin = min(n,numObsPerChunk*(j-1) + 1);
    iend = min(n,numObsPerChunk*j);
    idx = ibegin:iend;
    [isanom,scores] = isanomaly(IncrementalMdl,X(idx,:));
    allscores = [allscores;scores];
    anomIdx = [anomIdx;find(isanom)+ibegin-1];
    if (sum(isanom) < 3)
        IncrementalMdl = fit(IncrementalMdl,X(idx,:));
    end
end

Analyze Incremental Model During Training

At each iteration, the software calculates a score value for each observation in the data chunk. A negative score value with large magnitude indicates a normal observation, and a large positive value indicates an anomaly. Plot the anomaly score for the observations in the vicinity of the anomaly. Circle the scores of shingles that the software returns as anomalous.

figure
scatter(a(1:2000),allscores,".")
hold on
scatter(a(anomIdx),allscores(anomIdx),20,"or")
xlim([900,1200])
xlabel("Shingle")
ylabel("Score")
hold off

Figure contains an axes object. The axes object with xlabel Shingle, ylabel Score contains 2 objects of type scatter.

Because the introduced anomalous region begins at observation 1150, and the shingle size is 100, shingle 1051 is the first to show a high anomaly score. Some shingles between 1050 and 1170 have scores lying just below the anomaly score threshold, due to the noise in the sinusoidal signal. The shingle size affects the performance of the model by defining how many subsequent consecutive data points in the original time series the software uses to calculate the anomaly score for each shingle.

Plot the unshingled data and highlight the introduced anomalous region. Circle the observation number of the first element in each shingle returned by that the software as anomalous.

figure
xlim([900,1200])
ylim([-1.5 2])
rectangle(Position=[1150 -1.5 20 3.5],FaceColor=[0.9 0.9 0.9], ...
    EdgeColor=[0.9 0.9 0.9])
hold on
scatter(a,b,".")
scatter(a(anomIdx),b(anomIdx),20,"or")
xlabel("Observation")
hold off

Figure contains an axes object. The axes object with xlabel Observation contains 3 objects of type rectangle, scatter.

Input Arguments

collapse all

`forest` — Traditionally trained RRCF model for anomaly detection
`RobustRandomCutForest` model object

Traditionally trained RRCF model for anomaly detection, specified as a RobustRandomCutForest model object returned by rrcforest.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: incrementalLearner(forest,ObservationRemoval="timedecaying",ScoreWarmupPeriod=500) sets the observation removal method to "timedecaying" and specifies to process 500 observations before the incremental fit function returns scores and detects anomalies.

`NumObservationsToKeep` — Number of most recent observations relevant for anomaly detection
`forest.NumObservationsPerLearner` (default) | nonnegative integer

Number of the most recent observations relevant for anomaly detection, specified as a nonnegative integer.

Example: NumObservationsToKeep=250

Data Types: single | double

`ObservationRemoval` — Observation removal method
`"oldest"` (default) | `"timedecaying"` | `"random"`

Observation removal method, specified as "oldest", "timedecaying", or "random". When the robust random cut trees reach their capacity, the software removes old observations to accommodate the most recent data.

Value	Description
`"oldest"`	Oldest observations are removed first.
`"timedecaying"`	Observations are removed randomly in a weighted fashion. Older observations have a higher probability of being removed first.
`"random"`	Observations are removed in random order.

Data Types: string | char

`Options` — Options for computing in parallel and setting random streams
structure

Options for computing in parallel and setting random streams, specified as a structure. Create the Options structure using statset. This table lists the option fields and their values.

Field Name Value Default

UseParallel Set this value to true to run computations in parallel. false

Field Name	Value	Default
`UseParallel`	Set this value to `true` to run computations in parallel.	`false`
`UseSubstreams`	Set this value to `true` to run computations in a reproducible manner. To compute reproducibly, set `Streams` to a type that allows substreams: `"mlfg6331_64"` or `"mrg32k3a"`.	`false`
`Streams`	Specify this value as a `RandStream` object or cell array of such objects. Use a single object except when the `UseParallel` value is `true` and the `UseSubstreams` value is `false`. In that case, use a cell array that has the same size as the parallel pool.	If you do not specify `Streams`, then `incrementalLearner` uses the default stream or streams.

UseSubstreams

Set this value to true to run computations in a reproducible manner.

To compute reproducibly, set Streams to a type that allows substreams: "mlfg6331_64" or "mrg32k3a".

false

Streams Specify this value as a RandStream object or cell array of such objects. Use a single object except when the UseParallel value is true and the UseSubstreams value is false. In that case, use a cell array that has the same size as the parallel pool. If you do not specify Streams, then incrementalLearner uses the default stream or streams.

Note

You need Parallel Computing Toolbox™ to run computations in parallel.

Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))

Data Types: struct

`ScoreWarmupPeriod` — Warm-up period before score computation and anomaly detection
`0` (default) | nonnegative integer

Warm-up period before score computation and anomaly detection, specified as a nonnegative integer. This option specifies the number of observations used by the incremental fit function to train the model and estimate the score threshold.

Note

When processing observations during the score warm-up period, the software ignores observations that contain missing values for all predictors.

Example: ScoreWarmupPeriod=200

Data Types: single | double

`ScoreWindowSize` — Running window size used to estimate score threshold
`1000` (default) | positive integer

Running window size used to estimate the score threshold (ScoreThreshold), specified as a positive integer. The default ScoreWindowSize value is 1000.

If ScoreWindowSize is greater than the number of observations in the training data, the software determines ScoreThreshold by subsampling from the training data. Otherwise, ScoreThreshold is set to forest.ScoreThreshold.

Example: ScoreWindowSize=100

Data Types: single | double

Output Arguments

collapse all

`IncrementalForest` — RRCF model for incremental anomaly detection
`incrementalRobustRandomCutForest` model object

RRCF model for incremental anomaly detection, returned as an incrementalRobustRandomCutForest model object.

To initialize IncrementalForest for incremental anomaly detection, incrementalLearner passes the values of the following properties of forest to the corresponding properties of IncrementalForest.

Property	Description
`CategoricalPredictors`	Categorical predictor indices, a vector of positive integers
`ContaminationFraction`	Fraction of anomalies in the training data, a numeric scalar in the range `[0,1]`
`Mu`	Predictor means of the training data, a numeric vector
`NumLearners`	Number of robust random cut trees, a positive integer scalar
`NumObservationsPerLearner`	Number of observations for each robust random cut tree, a nonnegative integer
`PredictorNames`	Predictor variable names, a cell array of character vectors
`ScoreThreshold`	Threshold score for anomalies in the training data, a numeric scalar in the range [0,`Inf`). If `ScoreWindowSize` is greater than the number of observations used to train `forest`, then `incrementalLearner` approximates `ScoreThreshold` by subsampling from the training data. Otherwise, `incrementalLearner` passes `forest.ScoreThreshold` to `IncrementalForest.ScoreThreshold`.
`Sigma`	Predictor standard deviations of the training data, a numeric vector

More About

collapse all

Incremental Learning for Anomaly Detection

Incremental learning, or online learning, is a branch of machine learning concerned with processing incoming data from a data stream, possibly given little to no knowledge of the distribution of the predictor variables, aspects of the prediction or objective function (including tuning parameter values), or whether the observations contain anomalies. Incremental learning differs from traditional machine learning, where enough data is available to fit to a model, perform cross-validation to tune hyperparameters, and infer the predictor distribution.

Anomaly detection is used to identify unexpected events and departures from normal behavior. In situations where the full data set is not immediately available, or new data is arriving, you can use incremental learning for anomaly detection to incrementally train a model so it adjusts to the characteristics of the incoming data.

Given incoming observations, an incremental learning model for anomaly detection does the following:

Computes anomaly scores
Updates the anomaly score threshold
Detects data points above the score threshold as anomalies
Fits the model to the incoming observations

For more information, see Incremental Anomaly Detection with MATLAB.

References

[1] Guha, Sudipto, N. Mishra, G. Roy, and O. Schrijvers. "Robust Random Cut Forest Based Anomaly Detection on Streams," Proceedings of The 33rd International Conference on Machine Learning 48 (June 2016): 2712–21.

[2] Bartos, Matthew D., A. Mullapudi, and S. C. Troutman. "rrcf: Implementation of the Robust Random Cut Forest Algorithm for Anomaly Detection on Streams." Journal of Open Source Software 4, no. 35 (2019): 1336.

Extended Capabilities

expand all

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

To run in parallel, specify the Options name-value argument in the call to this function and set the UseParallel field of the options structure to true using statset:

Options=statset(UseParallel=true)

For more information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).

Version History

Introduced in R2023b

incrementalLearner

Syntax

Description

Examples

Perform Incremental RRCF Anomaly Detection with Categorical Predictor Data

Incrementally Train RRCF Model on Shingled Data

Input Arguments

`forest` — Traditionally trained RRCF model for anomaly detection
`RobustRandomCutForest` model object

Name-Value Arguments

`NumObservationsToKeep` — Number of most recent observations relevant for anomaly detection
`forest.NumObservationsPerLearner` (default) | nonnegative integer

`ObservationRemoval` — Observation removal method
`"oldest"` (default) | `"timedecaying"` | `"random"`

`Options` — Options for computing in parallel and setting random streams
structure

`ScoreWarmupPeriod` — Warm-up period before score computation and anomaly detection
`0` (default) | nonnegative integer

`ScoreWindowSize` — Running window size used to estimate score threshold
`1000` (default) | positive integer

Output Arguments

`IncrementalForest` — RRCF model for incremental anomaly detection
`incrementalRobustRandomCutForest` model object

More About

Incremental Learning for Anomaly Detection

References

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

See Also

Functions

Objects

Topics

incrementalLearner

Syntax

Description

Examples

Perform Incremental RRCF Anomaly Detection with Categorical Predictor Data

Incrementally Train RRCF Model on Shingled Data

Input Arguments

forest — Traditionally trained RRCF model for anomaly detection RobustRandomCutForest model object

Name-Value Arguments

NumObservationsToKeep — Number of most recent observations relevant for anomaly detection forest.NumObservationsPerLearner (default) | nonnegative integer

ObservationRemoval — Observation removal method "oldest" (default) | "timedecaying" | "random"

Options — Options for computing in parallel and setting random streams structure

ScoreWarmupPeriod — Warm-up period before score computation and anomaly detection 0 (default) | nonnegative integer

ScoreWindowSize — Running window size used to estimate score threshold 1000 (default) | positive integer

Output Arguments

IncrementalForest — RRCF model for incremental anomaly detection incrementalRobustRandomCutForest model object

More About

Incremental Learning for Anomaly Detection

References

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

See Also

Functions

Objects

Topics

`forest` — Traditionally trained RRCF model for anomaly detection
`RobustRandomCutForest` model object

`NumObservationsToKeep` — Number of most recent observations relevant for anomaly detection
`forest.NumObservationsPerLearner` (default) | nonnegative integer

`ObservationRemoval` — Observation removal method
`"oldest"` (default) | `"timedecaying"` | `"random"`

`Options` — Options for computing in parallel and setting random streams
structure

`ScoreWarmupPeriod` — Warm-up period before score computation and anomaly detection
`0` (default) | nonnegative integer

`ScoreWindowSize` — Running window size used to estimate score threshold
`1000` (default) | positive integer

`IncrementalForest` — RRCF model for incremental anomaly detection
`incrementalRobustRandomCutForest` model object

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.