Main Content

ClassificationPartitionedGAM

Cross-validated generalized additive model (GAM) for classification

Since R2021a

    Description

    ClassificationPartitionedGAM is a set of generalized additive models trained on cross-validated folds. Estimate the quality of the cross-validated classification by using one or more kfold functions: kfoldPredict, kfoldLoss, kfoldMargin, kfoldEdge, and kfoldfun.

    Every kfold object function uses models trained on training-fold (in-fold) observations to predict the response for validation-fold (out-of-fold) observations. For example, suppose you cross-validate using five folds. The software randomly assigns each observation into five groups of equal size (roughly). The training fold contains four of the groups (roughly 4/5 of the data), and the validation fold contains the other group (roughly 1/5 of the data). In this case, cross-validation proceeds as follows:

    1. The software trains the first model (stored in CVMdl.Trained{1}) by using the observations in the last four groups, and reserves the observations in the first group for validation.

    2. The software trains the second model (stored in CVMdl.Trained{2}) by using the observations in the first group and the last three groups. The software reserves the observations in the second group for validation.

    3. The software proceeds in a similar manner for the third, fourth, and fifth models.

    If you validate by using kfoldPredict, the software computes predictions for the observations in group i by using the ith model. In short, the software estimates a response for every observation by using the model trained without that observation.

    Creation

    You can create a ClassificationPartitionedGAM model in two ways:

    • Create a cross-validated model from a GAM object ClassificationGAM by using the crossval object function.

    • Create a cross-validated model by using the fitcgam function and specifying one of the name-value arguments 'CrossVal', 'CVPartition', 'Holdout', 'KFold', or 'Leaveout'.

    Properties

    expand all

    Cross-Validation Properties

    This property is read-only.

    Cross-validated model name, specified as 'GAM'.

    This property is read-only.

    Number of cross-validated folds, specified as a positive integer.

    Data Types: double

    This property is read-only.

    Cross-validation parameter values, specified as an object. The parameter values correspond to the values of the name-value arguments used to cross-validate the generalized additive model. ModelParameters does not contain estimated parameters.

    You can access the properties of ModelParameters using dot notation.

    This property is read-only.

    Data partition indicating how the software splits the data into cross-validation folds, specified as a cvpartition model.

    This property is read-only.

    Compact classifiers trained on cross-validation folds, specified as a cell array of CompactClassificationGAM model objects. Trained has k cells, where k is the number of folds.

    Data Types: cell

    Other Classification Properties

    This property is read-only.

    Categorical predictor indices, specified as a vector of positive integers. CategoricalPredictors contains index values indicating that the corresponding predictors are categorical. The index values are between 1 and p, where p is the number of predictors used to train the model. If none of the predictors are categorical, then this property is empty ([]).

    Data Types: double

    This property is read-only.

    Unique class labels used in training, specified as a categorical or character array, logical or numeric vector, or cell array of character vectors. ClassNames has the same data type as the class labels Y. (The software treats string arrays as cell arrays of character vectors.) ClassNames also determines the class order.

    Data Types: single | double | logical | char | cell | categorical

    Misclassification costs, specified as a 2-by-2 numeric matrix.

    Cost(i,j) is the cost of classifying a point into class j if its true class is i. The order of the rows and columns of Cost corresponds to the order of the classes in ClassNames.

    The software uses the Cost value for prediction, but not training. You can change the value by using dot notation.

    Example: Mdl.Cost = C;

    Data Types: double

    This property is read-only.

    Number of observations in the training data stored in X and Y, specified as a numeric scalar.

    Data Types: double

    This property is read-only.

    Predictor variable names, specified as a cell array of character vectors. The order of the elements in PredictorNames corresponds to the order in which the predictor names appear in the training data.

    Data Types: cell

    This property is read-only.

    Prior class probabilities, specified as a numeric vector with two elements. The order of the elements corresponds to the order of the elements in ClassNames.

    Data Types: double

    This property is read-only.

    Response variable name, specified as a character vector.

    Data Types: char

    Score transformation, specified as a character vector or function handle. ScoreTransform represents a built-in transformation function or a function handle for transforming predicted classification scores.

    To change the score transformation function to function, for example, use dot notation.

    • For a built-in function, enter a character vector.

      Mdl.ScoreTransform = 'function';

      This table describes the available built-in functions.

      ValueDescription
      'doublelogit'1/(1 + e–2x)
      'invlogit'log(x / (1 – x))
      'ismax'Sets the score for the class with the largest score to 1, and sets the scores for all other classes to 0
      'logit'1/(1 + ex)
      'none' or 'identity'x (no transformation)
      'sign'–1 for x < 0
      0 for x = 0
      1 for x > 0
      'symmetric'2x – 1
      'symmetricismax'Sets the score for the class with the largest score to 1, and sets the scores for all other classes to –1
      'symmetriclogit'2/(1 + ex) – 1

    • For a MATLAB® function or a function that you define, enter its function handle.

      Mdl.ScoreTransform = @function;

      function must accept a matrix (the original scores) and return a matrix of the same size (the transformed scores).

    This property determines the output score computation for object functions such as kfoldPredict, kfoldMargin, and kfoldEdge. Use 'logit' to compute posterior probabilities, and use 'none' to compute the logit of posterior probabilities.

    Data Types: char | function_handle

    This property is read-only.

    Observation weights used to train the model, specified as an n-by-1 numeric vector. n is the number of observations (NumObservations).

    The software normalizes the observation weights specified in the 'Weights' name-value argument so that the elements of W within a particular class sum up to the prior probability of that class.

    Data Types: double

    This property is read-only.

    Predictors used to cross-validate the model, specified as a numeric matrix or table.

    Each row of X corresponds to one observation, and each column corresponds to one variable.

    Data Types: single | double | table

    This property is read-only.

    Class labels used to cross-validate the model, specified as a categorical or character array, logical or numeric vector, or cell array of character vectors. Y has the same data type as the response variable used to train the model. (The software treats string arrays as cell arrays of character vectors.)

    Each row of Y represents the observed classification of the corresponding row of X.

    Data Types: single | double | logical | char | cell | categorical

    Object Functions

    kfoldPredictClassify observations in cross-validated classification model
    kfoldLossClassification loss for cross-validated classification model
    kfoldMarginClassification margins for cross-validated classification model
    kfoldEdgeClassification edge for cross-validated classification model
    kfoldfunCross-validate function for classification

    Examples

    collapse all

    Train a cross-validated GAM with 10 folds, which is the default cross-validation option, by using fitcgam. Then, use kfoldPredict to predict class labels for validation-fold observations using a model trained on training-fold observations.

    Load the ionosphere data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

    load ionosphere

    Create a cross-validated GAM by using the default cross-validation option. Specify the 'CrossVal' name-value argument as 'on'.

    rng('default') % For reproducibility
    CVMdl = fitcgam(X,Y,'CrossVal','on')
    CVMdl = 
      ClassificationPartitionedGAM
        CrossValidatedModel: 'GAM'
             PredictorNames: {'x1'  'x2'  'x3'  'x4'  'x5'  'x6'  'x7'  'x8'  'x9'  'x10'  'x11'  'x12'  'x13'  'x14'  'x15'  'x16'  'x17'  'x18'  'x19'  'x20'  'x21'  'x22'  'x23'  'x24'  'x25'  'x26'  'x27'  'x28'  'x29'  'x30'  'x31'  'x32'  'x33'  'x34'}
               ResponseName: 'Y'
            NumObservations: 351
                      KFold: 10
                  Partition: [1x1 cvpartition]
          NumTrainedPerFold: [1x1 struct]
                 ClassNames: {'b'  'g'}
             ScoreTransform: 'logit'
    
    
    

    The fitcgam function creates a ClassificationPartitionedGAM model object CVMdl with 10 folds. During cross-validation, the software completes these steps:

    1. Randomly partition the data into 10 sets.

    2. For each set, reserve the set as validation data, and train the model using the other 9 sets.

    3. Store the 10 compact, trained models in a 10-by-1 cell vector in the Trained property of the cross-validated model object ClassificationPartitionedGAM.

    You can override the default cross-validation setting by using the 'CVPartition', 'Holdout', 'KFold', or 'Leaveout' name-value argument.

    Classify the observations in X by using kfoldPredict. The function predicts class labels for every observation using the model trained without that observation.

    label = kfoldPredict(CVMdl);

    Create a confusion matrix to compare the true classes of the observations to their predicted labels.

    C = confusionchart(Y,label);

    Compute the classification error.

    L = kfoldLoss(CVMdl)
    L = 0.0712
    

    The average misclassification rate over 10 folds is about 7%.

    Train a GAM by using fitcgam, and create a cross-validated GAM by using crossval and the holdout option. Then, use kfoldPredict to predict responses for validation-fold observations using a model trained on training-fold observations.

    Load the 1994 census data stored in census1994.mat. The data set consists of demographic data from the US Census Bureau to predict whether an individual makes over $50,000 per year. The classification task is to fit a model that predicts the salary category of people given their age, working class, education level, marital status, race, and so on.

    load census1994

    census1994 contains the training data set adultdata and the test data set adulttest. To reduce the running time for this example, subsample 500 training observations from adultdata by using the datasample function.

    rng('default')
    NumSamples = 5e2;
    adultdata = datasample(adultdata,NumSamples,'Replace',false);

    Train a GAM that contains both linear and interaction terms for predictors. Specify to include all available interaction terms whose p-values are not greater than 0.05.

    Mdl = fitcgam(adultdata,'salary','Interactions','all','MaxPValue',0.05);

    Mdl is a ClassificationGAM model object.

    Cross-validate the model by specifying a 30% holdout sample.

    CVMdl = crossval(Mdl,'Holdout',0.3)
    CVMdl = 
      ClassificationPartitionedGAM
          CrossValidatedModel: 'GAM'
               PredictorNames: {'age'  'workClass'  'fnlwgt'  'education'  'education_num'  'marital_status'  'occupation'  'relationship'  'race'  'sex'  'capital_gain'  'capital_loss'  'hours_per_week'  'native_country'}
        CategoricalPredictors: [2 4 6 7 8 9 10 14]
                 ResponseName: 'salary'
              NumObservations: 500
                        KFold: 1
                    Partition: [1x1 cvpartition]
            NumTrainedPerFold: [1x1 struct]
                   ClassNames: [<=50K    >50K]
               ScoreTransform: 'logit'
    
    
    

    The crossval function creates a ClassificationPartitionedGAM model object CVMdl with the holdout option. During cross-validation, the software completes these steps:

    1. Randomly select and reserve 30% of the data as validation data, and train the model using the rest of the data.

    2. Store the compact, trained model in the Trained property of the cross-validated model object ClassificationPartitionedGAM.

    You can choose a different cross-validation setting by using the 'CrossVal', 'CVPartition', 'KFold', or 'Leaveout' name-value argument.

    Classify the validation-fold observations by using kfoldPredict. The function predicts class labels for the validation-fold observations by using the model trained on the training-fold observations. The function assigns the most frequently predicted label to the training-fold observations.

    [labels,scores] = kfoldPredict(CVMdl);

    Find the validation-fold observations. kfoldPredict returns 0 scores for both classes for the training-fold observations. Therefore, you can identify the validation-fold observations by finding the observations whose scores are all zeros.

    idx = find(sum(abs(scores),2)~=0);

    Create a confusion matrix to compare the true classes of the observations to their predicted labels, and compute the classification error for the validation-fold observations.

    C = confusionchart(adultdata.salary(idx),labels(idx));

    L = kfoldLoss(CVMdl)
    L = 0.1800
    

    Train a cross-validated generalized additive model (GAM) with 10 folds. Then, use kfoldLoss to compute cumulative cross-validation classification errors (misclassification rate in decimal). Use the errors to determine the optimal number of trees per predictor (linear term for predictor) and the optimal number of trees per interaction term.

    Alternatively, you can find optimal values of fitcgam name-value arguments by using the OptimizeHyperparameters name-value argument. For an example, see Optimize GAM Using OptimizeHyperparameters.

    Load the ionosphere data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

    load ionosphere

    Create a cross-validated GAM by using the default cross-validation option. Specify the 'CrossVal' name-value argument as 'on'. Specify to include all available interaction terms whose p-values are not greater than 0.05.

    rng('default') % For reproducibility
    CVMdl = fitcgam(X,Y,'CrossVal','on','Interactions','all','MaxPValue',0.05);

    If you specify 'Mode' as 'cumulative' for kfoldLoss, then the function returns cumulative errors, which are the average errors across all folds obtained using the same number of trees for each fold. Display the number of trees for each fold.

    CVMdl.NumTrainedPerFold 
    ans = struct with fields:
          PredictorTrees: [65 64 59 61 60 66 65 62 64 61]
        InteractionTrees: [1 2 2 2 2 1 2 2 2 2]
    
    

    kfoldLoss can compute cumulative errors using up to 59 predictor trees and one interaction tree.

    Plot the cumulative, 10-fold cross-validated, classification error (misclassification rate in decimal). Specify 'IncludeInteractions' as false to exclude interaction terms from the computation.

    L_noInteractions = kfoldLoss(CVMdl,'Mode','cumulative','IncludeInteractions',false);
    figure
    plot(0:min(CVMdl.NumTrainedPerFold.PredictorTrees),L_noInteractions)

    The first element of L_noInteractions is the average error over all folds obtained using only the intercept (constant) term. The (J+1)th element of L_noInteractions is the average error obtained using the intercept term and the first J predictor trees per linear term. Plotting the cumulative loss allows you to monitor how the error changes as the number of predictor trees in GAM increases.

    Find the minimum error and the number of predictor trees used to achieve the minimum error.

    [M,I] = min(L_noInteractions)
    M = 0.0655
    
    I = 23
    

    The GAM achieves the minimum error when it includes 22 predictor trees.

    Compute the cumulative classification error using both linear terms and interaction terms.

    L = kfoldLoss(CVMdl,'Mode','cumulative')
    L = 2×1
    
        0.0712
        0.0712
    
    

    The first element of L is the average error over all folds obtained using the intercept (constant) term and all predictor trees per linear term. The second element of L is the average error obtained using the intercept term, all predictor trees per linear term, and one interaction tree per interaction term. The error does not decrease when interaction terms are added.

    If you are satisfied with the error when the number of predictor trees is 22, you can create a predictive model by training the univariate GAM again and specifying 'NumTreesPerPredictor',22 without cross-validation.

    Version History

    Introduced in R2021a