Documentation

This is machine translation

Mouseover text to see original. Click the button below to return to the English version of the page.

crossval

Loss estimate using cross-validation

Syntax

```vals = crossval(fun,X) vals = crossval(fun,X,Y,...) mse = crossval('mse',X,y,'Predfun',predfun) mcr = crossval('mcr',X,y,'Predfun',predfun) val = crossval(criterion,X1,X2,...,y,'Predfun',predfun) vals = crossval(...,'name',value) ```

Description

`vals = crossval(fun,X)` performs 10-fold cross-validation for the function `fun`, applied to the data in `X`.

`fun` is a function handle to a function with two inputs, the training subset of `X`, `XTRAIN`, and the test subset of `X`, `XTEST`, as follows:

`testval = fun(XTRAIN,XTEST)`

Each time it is called, `fun` should use `XTRAIN` to fit a model, then return some criterion `testval` computed on `XTEST` using that fitted model.

`X` can be a column vector or a matrix. Rows of `X` correspond to observations; columns correspond to variables or features. Each row of `vals` contains the result of applying `fun` to one test set. If `testval` is a non-scalar value, `crossval` converts it to a row vector using linear indexing and stored in one row of `vals`.

`vals = crossval(fun,X,Y,...)` is used when data are stored in separate variables `X`, `Y`, ... . All variables (column vectors, matrices, or arrays) must have the same number of rows. `fun` is called with the training subsets of `X`, `Y`, ... , followed by the test subsets of `X`, `Y`, ... , as follows:

`testvals = fun(XTRAIN,YTRAIN,...,XTEST,YTEST,...)`

`mse = crossval('mse',X,y,'Predfun',predfun)` returns `mse`, a scalar containing a 10-fold cross-validation estimate of mean-squared error for the function `predfun`. `X` can be a column vector, matrix, or array of predictors. `y` is a column vector of response values. `X` and `y` must have the same number of rows.

`predfun` is a function handle called with the training subset of `X`, the training subset of `y`, and the test subset of `X` as follows:

```yfit = predfun(XTRAIN,ytrain,XTEST) ```

Each time it is called, `predfun` should use `XTRAIN` and `ytrain` to fit a regression model and then return fitted values in a column vector `yfit`. Each row of `yfit` contains the predicted values for the corresponding row of `XTEST`. `crossval` computes the squared errors between `yfit` and the corresponding response test set, and returns the overall mean across all test sets.

`mcr = crossval('mcr',X,y,'Predfun',predfun)` returns `mcr`, a scalar containing a 10-fold cross-validation estimate of misclassification rate (the proportion of misclassified samples) for the function `predfun`. The matrix `X` contains predictor values and the vector `y` contains class labels. `predfun` should use `XTRAIN` and `YTRAIN` to fit a classification model and return `yfit` as the predicted class labels for `XTEST`. `crossval` computes the number of misclassifications between `yfit` and the corresponding response test set, and returns the overall misclassification rate across all test sets.

`val = crossval(criterion,X1,X2,...,y,'Predfun',predfun)`, where `criterion` is `'mse'` or `'mcr'`, returns a cross-validation estimate of mean-squared error (for a regression model) or misclassification rate (for a classification model) with predictor values in `X1`, `X2`, ... and, respectively, response values or class labels in `y`. `X1`, `X2`, ... and `y` must have the same number of rows. `predfun` is a function handle called with the training subsets of `X1`, `X2`, ..., the training subset of `y`, and the test subsets of `X1`, `X2`, ..., as follows:

```yfit=predfun(X1TRAIN,X2TRAIN,...,ytrain,X1TEST,X2TEST,...) ```

`yfit` should be a column vector containing the fitted values.

`vals = crossval(...,'name',value)` specifies one or more optional parameter name/value pairs from the following table. Specify `name` inside single quotes.

NameValue
`holdout`

A scalar specifying the ratio or the number of observations `p` for holdout cross-validation. When `0` < `p` < `1`, approximately `p*n` observations for the test set are randomly selected. When `p` is an integer, `p` observations for the test set are randomly selected.

`kfold`

A positive integer that is greater than 1 specifying the number of folds `k` for `k`-fold cross-validation.

`leaveout`

Specifies leave-one-out cross-validation. The value must be `1`.

`mcreps`

A positive integer specifying the number of Monte-Carlo repetitions for validation. If the first input of `crossval` is `'mse'` or `'mcr'`, `crossval` returns the mean of mean-squared error or misclassification rate across all of the Monte-Carlo repetitions. Otherwise, `crossval` concatenates the values `vals` from all of the Monte-Carlo repetitions along the first dimension.

`partition`

An object `c` of the `cvpartition` class, specifying the cross-validation type and partition.

`stratify`

A column vector `group` specifying groups for stratification. Both training and test sets have roughly the same class proportions as in `group`. `NaN`s, empty character vectors, empty strings, `<missing>` values, and `<undefined>` values in `group` are treated as missing data values, and the corresponding rows of the data are ignored.

`options`

A structure that specifies whether to run in parallel, and specifies the random stream or streams. Create the `options` structure with `statset`. Option fields:

• `UseParallel` — Set to `true` to compute in parallel. Default is `false`.

You need Parallel Computing Toolbox™ for parallel computation.

• `UseSubstreams` — Set to `true` to compute in parallel in a reproducible fashion. Default is `false`. To compute reproducibly, set `Streams` to a type allowing substreams: `'mlfg6331_64'` or `'mrg32k3a'`.

• `Streams` — A `RandStream` object or cell array consisting of one such object. If you do not specify `Streams`, `crossval` uses the default stream.

Only one of `kfold`, `holdout`, `leaveout`, or `partition` can be specified, and `partition` cannot be specified with `stratify`. If both `partition` and `mcreps` are specified, the first Monte-Carlo repetition uses the partition information in the `cvpartition` object, and the `repartition` method is called to generate new partitions for each of the remaining repetitions. If no cross-validation type is specified, the default is 10-fold cross-validation.

Note

When using cross-validation with classification algorithms, stratification is preferred. Otherwise, some test sets may not include observations from all classes.

Examples

Example 1

Compute mean-squared error for regression using 10-fold cross-validation:

```load('fisheriris'); y = meas(:,1); X = [ones(size(y,1),1),meas(:,2:4)]; regf=@(XTRAIN,ytrain,XTEST)(XTEST*regress(ytrain,XTRAIN)); cvMse = crossval('mse',X,y,'predfun',regf) cvMse = 0.1015```

Example 2

Compute misclassification rate using stratified 10-fold cross-validation:

```load('fisheriris'); y = species; X = meas; cp = cvpartition(y,'k',10); % Stratified cross-validation classf = @(XTRAIN, ytrain,XTEST)(classify(XTEST,XTRAIN,... ytrain)); cvMCR = crossval('mcr',X,y,'predfun',classf,'partition',cp) cvMCR = 0.0200```

Example 3

Compute the confusion matrix using stratified 10-fold cross-validation:

```load('fisheriris'); y = species; X = meas; order = unique(y); % Order of the group labels cp = cvpartition(y,'k',10); % Stratified cross-validation f = @(xtr,ytr,xte,yte)confusionmat(yte,... classify(xte,xtr,ytr),'order',order); cfMat = crossval(f,X,y,'partition',cp); cfMat = reshape(sum(cfMat),3,3) cfMat = 50 0 0 0 48 2 0 1 49```

`cfMat` is the summation of 10 confusion matrices from 10 test sets.

References

 Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. New York: Springer, 2001.