Documentation

resubPredict

Predict resubstitution labels of k-nearest neighbor classifier

Syntax

``label = resubPredict(mdl)``
``````[label,score] = resubPredict(mdl)``````
``````[label,score,cost] = resubPredict(mdl)``````

Description

example

````label = resubPredict(mdl)` returns the labels that `mdl` predicts for the training data `mdl.X`. The output `label` contains the predictions of `mdl` on the data used by `fitcknn` to create `mdl`.```
``````[label,score] = resubPredict(mdl)``` also returns the posterior class probabilities for the predictions.```
``````[label,score,cost] = resubPredict(mdl)``` also returns the misclassification costs.```

Examples

collapse all

Examine the quality of a classifier by its resubstitution predictions.

Load the Fisher iris data set.

```load fisheriris X = meas; Y = species;```

Create a classifier for five nearest neighbors.

`mdl = fitcknn(X,Y,'NumNeighbors',5);`

Generate the resubstitution predictions.

`label = resubPredict(mdl);`

Calculate the number of differences between the predictions `label` and the original data `Y`.

```mydiff = not(strcmp(Y,label)); % mydiff(i) = 1 means they differ sum(mydiff) % Number of differences```
```ans = 5 ```

A value of `1` in `mydiff` indicates that the observed label differs from the corresponding predicted label. This example has five misclassifications.

Input Arguments

collapse all

k-nearest neighbor classifier model, specified as a `ClassificationKNN` object.

Output Arguments

collapse all

Predicted class labels for the observations (rows) in the training data `X`, returned as a categorical array, character array, logical vector, numeric vector, or cell array of character vectors. `label` has length equal to the number of rows in `X`. The label is the class with minimal expected cost. See Predicted Class Label.

Predicted class scores or posterior probabilities, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in the training data `X`, and K is the number of classes (in `mdl.ClassNames`). `score(i,j)` is the posterior probability that observation `i` in `X` is of class `j` in `mdl.ClassNames`. See Posterior Probability.

Expected classification costs, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in the training data `X`, and K is the number of classes (in `mdl.ClassNames`). `cost(i,j)` is the cost of classifying row `i` of `X` as class `j` in `mdl.ClassNames`. See Expected Cost.

Tips

• If you standardize the predictor data, that is, `mdl.Mu` and `mdl.Sigma` are not empty (`[]`), then `resubPredict` standardizes the predictor data before predicting labels.

Algorithms

collapse all

Predicted Class Label

`resubPredict` classifies by minimizing the expected classification cost:

`$\stackrel{^}{y}=\underset{y=1,...,K}{\mathrm{arg}\mathrm{min}}\sum _{j=1}^{K}\stackrel{^}{P}\left(j|x\right)C\left(y|j\right),$`

where

• $\stackrel{^}{y}$ is the predicted classification.

• K is the number of classes.

• $\stackrel{^}{P}\left(j|x\right)$ is the posterior probability of class j for observation x.

• $C\left(y|j\right)$ is the cost of classifying an observation as y when its true class is j.

Posterior Probability

Consider a vector (single query point) `xnew` and a model `mdl`.

• k is the number of nearest neighbors used in prediction, `mdl.NumNeighbors`.

• `nbd(mdl,xnew)` specifies the k nearest neighbors to `xnew` in `mdl.X`.

• `Y(nbd)` specifies the classifications of the points in `nbd(mdl,xnew)`, namely `mdl.Y(nbd)`.

• `W(nbd)` specifies the weights of the points in `nbd(mdl,xnew)`.

• `prior` specifies the priors of the classes in `mdl.Y`.

If the model contains a vector of prior probabilities, then the observation weights `W` are normalized by class to sum to the priors. This process might involve a calculation for the point `xnew`, because weights can depend on the distance from `xnew` to the points in `mdl.X`.

The posterior probability p(j|`xnew`) is

`$p\left(j|x\text{new}\right)=\frac{\sum _{i\in \text{nbd}}W\left(i\right){1}_{Y\left(X\left(i\right)\right)=j}}{\sum _{i\in \text{nbd}}W\left(i\right)}.$`

Here, ${1}_{Y\left(X\left(i\right)\right)=j}$ is `1` when `mdl.Y(i) = j`, and `0` otherwise.

True Misclassification Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation.

You can set the true misclassification cost per class by using the `'Cost'` name-value pair argument when you run `fitcknn`. The value `Cost(i,j)` is the cost of classifying an observation into class `j` if its true class is `i`. By default, `Cost(i,j) = 1` if `i ~= j`, and `Cost(i,j) = 0` if `i = j`. In other words, the cost is `0` for correct classification and `1` for incorrect classification.

Expected Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation. The third output of `resubPredict` is the expected misclassification cost per observation.

Suppose you have `Nobs` observations that you classified with a trained classifier `mdl`, and you have `K` classes. The command

`[label,score,cost] = resubPredict(mdl)`

returns a matrix `cost` of size `Nobs`-by-`K`, among other outputs. Each row of the `cost` matrix contains the expected (average) cost of classifying the observation into each of the `K` classes. `cost(n,j)` is

`$\sum _{i=1}^{K}\stackrel{^}{P}\left(i|X\left(n\right)\right)C\left(j|i\right),$`

where

• K is the number of classes.

• $\stackrel{^}{P}\left(i|X\left(n\right)\right)$ is the posterior probability of class i for observation X(n).

• $C\left(j|i\right)$ is the true misclassification cost of classifying an observation as j when its true class is i.