Main Content

fit

Fit principal component analysis model to streaming data

Since R2024a

    Description

    The incremental fit function fits an incremental principal component analysis (PCA) object (incrementalPCA) to streaming data.

    IncrementalMdl = fit(IncrementalMdl,X) returns an incremental PCA model IncrementalMdl, which represents the input incremental PCA model IncrementalMdl fit using the predictor data X. Specifically, the incremental fit function fits the model to the incoming data and stores the updated PCA properties in the output model IncrementalMdl.

    example

    IncrementalMdl = fit(IncrementalMdl,X,Weights=weights) also sets the observation weights weights.

    [IncrementalMdl,Xtransformed] = fit(IncrementalMdl,X) additionally returns the principal component scores Xtransformed.

    example

    Examples

    collapse all

    Perform principal component analysis (PCA) on an initial data chunk, and then create an incremental PCA model that incorporates the results of the analysis. Fit the incremental model to streaming data and analyze how the model evolves during training.

    Load and Preprocess Data

    Load the human activity data set.

    load humanactivity

    For details on the human activity data set, enter Description at the command line.

    The data set includes observations containing 60 variables. To simulate streaming data, split the data set into an initial chunk of 1000 observations and a second chunk of 10,000 observations.

    Xinitial = feat(1:1000,:);
    Xstream = feat(1001:11000,:);

    Perform Initial PCA

    Perform PCA on the initial data chunk by using the pca function. Specify to center the data and keep 10 principal components. Return the principal component coefficients (coeff), principal component variances (latent), and estimated means of the variables (mu).

    [coeff,~,latent,~,~,mu]=pca(Xinitial,Centered=true,NumComponents=10);

    Create Incremental PCA Model

    Create a model for incremental PCA that incorporates the PCA results from the initial data chunk.

    IncrementalMdl = incrementalPCA(Coefficients=coeff,Latent=latent, ...
        Means=mu,NumObservations=1000);
    details(IncrementalMdl)
      incrementalPCA with properties:
    
                         IsWarm: 1
        NumTrainingObservations: 0
                   WarmupPeriod: 0
                             Mu: [0.7764 0.4931 -0.3407 0.1108 0.0707 0.0485 0.3931 -1.1100 0.0646 0.1703 -1.1020 0.0283 0.0836 -1.0797 0.0139 0.9328 1.2892 1.6731 2.0729 2.5181 2.9511 0.0128 0.0062 0.0039 0.0027 0.0020 0.0016 0.9322 ... ] (1x60 double)
                          Sigma: []
              ExplainedVariance: [10x1 double]
               EstimationPeriod: 0
                         Latent: [10x1 double]
                   Coefficients: [60x10 double]
                VariableWeights: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
                  NumComponents: 10
                  NumPredictors: 60
    

    IncrementalMdl is an incrementalPCA model object. All its properties are read-only. Because Coefficients and Latent are specified, the model is warm, meaning that the fit function returns transformed observations.

    Fit Incremental Model

    Fit the incremental model IncrementalMdl to the data by using the fit function. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:

    • Process 100 observations.

    • Overwrite the previous incremental model with a new one fitted to the incoming observations.

    • Store topEV, the explained variance value of the component with the highest variance, to see how it evolves during incremental fitting.

    n = numel(Xstream(:,1));
    numObsPerChunk = 100;
    nchunk = floor(n/numObsPerChunk);
    topEV = zeros(nchunk,1);
    
    % Incremental fitting
    for j = 1:nchunk
        ibegin = min(n,numObsPerChunk*(j-1) + 1);
        iend = min(n,numObsPerChunk*j);
        IncrementalMdl = fit(IncrementalMdl,Xstream(ibegin:iend,:));
        topEV(j) = IncrementalMdl.ExplainedVariance(1);
    end

    IncrementalMdl is an incrementalPCA model object fitted to all the data in the stream. The fit function fits the model to the data chunk and updates the model properties.

    Analyze Incremental Model During Training

    Plot the explained variance value of the component with the highest variance to see how it evolves during training.

    figure
    plot(topEV,".-")
    ylabel("topEV")
    xlabel("Iteration")
    xlim([0 nchunk])

    Figure contains an axes object. The axes object with xlabel Iteration, ylabel topEV contains an object of type line.

    The highest explained variance value is 33% after the first iteration, and rapidly rises to 80% after five iterations. The value then gradually approaches 97%.

    Create a model for incremental principal component analysis (PCA) and specify to standardize the data.

    IncrementalMdl = incrementalPCA(StandardizeData=true);
    details(IncrementalMdl)
      incrementalPCA with properties:
    
                         IsWarm: 0
        NumTrainingObservations: 0
                   WarmupPeriod: 1000
                             Mu: []
                          Sigma: []
              ExplainedVariance: [0x1 double]
               EstimationPeriod: 1000
                         Latent: [0x1 double]
                   Coefficients: []
                VariableWeights: [1x0 double]
                  NumComponents: 0
                  NumPredictors: 0
    

    IncrementalMdl is an incrementalPCA model object. All its properties are read-only. By default, the software sets the hyperparameter estimation period and the warm-up period to 1000 observations. The model must be warm before the incremental fit function outputs transformed data.

    Load and Preprocess Data

    Load the NYCHousing2015 sample data set.

    load NYCHousing2015

    The data set includes 10 variables with information on the sales of properties in New York City in 2015.

    Preprocess the data set. Remove the categorical variables BOROUGH, NEIGHBORHOOD and BUILDINGCLASSCATEGORY. Convert the datetime array (SALEDATE) to month numbers and change zeros in LANDSQUAREFEET, GROSSSQUAREFEET, SALEPRICE, and YEARBUILT to NaNs.

    NYCHousing2015 = removevars(NYCHousing2015,["BOROUGH", ...
        "NEIGHBORHOOD","BUILDINGCLASSCATEGORY"]);
    NYCHousing2015.SALEDATE = month(NYCHousing2015.SALEDATE);
    NYCHousing2015.LANDSQUAREFEET(NYCHousing2015.LANDSQUAREFEET == 0) = NaN; 
    NYCHousing2015.GROSSSQUAREFEET(NYCHousing2015.GROSSSQUAREFEET == 0) = NaN; 
    NYCHousing2015.SALEPRICE(NYCHousing2015.SALEPRICE == 0) = NaN; 
    NYCHousing2015.YEARBUILT(NYCHousing2015.YEARBUILT == 0) = NaN; 

    The fit function of incrementalPCA does not use observations that contain a missing value. Remove these observations from the data set.

    NYCHousing2015=rmmissing(NYCHousing2015);

    The incrementalPCA functions do not accept data in table format. Convert the data set to array format and keep only the first 5000 observations.

    streamingData = table2array(NYCHousing2015(1:end,:));
    streamingData=streamingData(1:5000,:);

    Fit Incremental Models

    Fit the incremental model IncrementalMdl to the data using the fit function. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:

    • Process 100 observations.

    • Overwrite the previous incremental model with a new one fitted to the incoming observations.

    • Store isWarm, the IsWarm property of IncrementalMdl, to see how it evolves during incremental fitting.

    • Store topEV, the explained variance value of the component with the highest variance, to see how it evolves during incremental fitting.

    • Store meanXtr, the mean of the transformed data output by the fit function, to see how it evolves during incremental fitting.

    n = numel(streamingData(:,1));
    numObsPerChunk = 100;
    nchunk = floor(n/numObsPerChunk);
    meanXtr = zeros(nchunk,1);
    isWarm = zeros(nchunk,1);  
    
    % Incremental fitting
    for j = 1:nchunk
        ibegin = min(n,numObsPerChunk*(j-1) + 1);
        iend = min(n,numObsPerChunk*j);
        [IncrementalMdl,Xtr] = fit(IncrementalMdl,streamingData(ibegin:iend,:));
        isWarm(j) = IncrementalMdl.IsWarm;
        topEV(j) = IncrementalMdl.ExplainedVariance(1);
        meanXtr(j)=mean(Xtr(:));
    end

    IncrementalMdl is an incrementalPCA model object fitted to all the data in the stream. fit fits the model to the data chunk and outputs the transformed input data.

    Analyze Incremental Model During Training

    To see how the IsWarm indicator, the explained variance value of the component with the highest variance, and the mean of the transformed input data per chunk evolve during training, plot them on separate tiles.

    figure
    tiledlayout(3,1);
    nexttile
    plot(isWarm,".-")
    ylabel("IsWarm")
    xlabel("Iteration")
    xlim([0 nchunk])
    nexttile
    plot(topEV,".-")
    ylabel("Top EV")
    xlabel("Iteration")
    xlim([0 nchunk])
    nexttile
    plot(meanXtr,".-")
    ylabel("Mean of Transformed Data")
    xlabel("Iteration")
    xlim([0 nchunk])

    Figure contains 3 axes objects. Axes object 1 with xlabel Iteration, ylabel IsWarm contains an object of type line. Axes object 2 with xlabel Iteration, ylabel Top EV contains an object of type line. Axes object 3 with xlabel Iteration, ylabel Mean of Transformed Data contains an object of type line.

    Because EstimationPeriod = 1000, fit processes 1000 observations to determine hyperparameters before updating the PCA properties of IncrementalMdl. After the estimation period, the top explained variance value initially fluctuates between 58% and 85%, and then gradually approaches 50%. Because WarmupPeriod = 1000, fit processes an additional 1000 observations after the estimation period before IncrementalMdl becomes warm and outputs transformed data. The mean of the transformed data fluctuates between –0.3 and 0.08.

    Input Arguments

    collapse all

    Incremental PCA model, specified as an incrementalPCA model object. You can create IncrementalMdl by calling incrementalPCA directly.

    Chunk of predictor data, specified as a floating-point matrix of n observations and IncrementalMdl.NumPredictors variables. The rows of X correspond to observations, and the columns correspond to variables. The software ignores observations that contain at least one missing value.

    Note

    • If IncrementalMdl.NumPredictors = 0, fit infers the number of predictors from X, and sets the corresponding property of the output model. Otherwise, if the number of predictor variables in the streaming data changes from IncrementalMdl.NumPredictors, fit issues an error.

    • fit supports only numeric input predictor data. If your input data includes categorical data, you must prepare an encoded version of the categorical data. Use dummyvar to convert each categorical variable to a numeric matrix of dummy variables. Then, concatenate all dummy variable matrices and any other numeric predictors. For more details, see Dummy Variables.

    Data Types: single | double

    Chunk of observation weights, specified as a floating-point vector of positive values. fit weighs the observations in X with the corresponding values in weights. The size of weights must equal n, the number of observations in X.

    By default, weights is ones(n,1).

    Data Types: single | double

    Output Arguments

    collapse all

    Updated incremental PCA model, returned as an incrementalPCA model object.

    Principal component scores, returned as a floating-point matrix. The rows of Xtransformed correspond to observations, and the columns correspond to components. If IncrementalMdl is not warm (IsWarm=false), all values of Xtransformed are returned as NaN. The data type of Xtransformed is the same as X.

    Version History

    Introduced in R2024a