# evalclusters

Evaluate clustering solutions

## Syntax

``eva = evalclusters(x,clust,criterion)``
``eva = evalclusters(x,clust,criterion,Name,Value)``

## Description

example

````eva = evalclusters(x,clust,criterion)` creates a clustering evaluation object containing data used to evaluate the optimal number of data clusters.```
````eva = evalclusters(x,clust,criterion,Name,Value)` creates a clustering evaluation object using additional options specified by one or more name-value pair arguments.```

## Examples

collapse all

Evaluate the optimal number of clusters using the Calinski-Harabasz clustering evaluation criterion.

`load fisheriris`

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion. Cluster the data using `kmeans`.

```rng('default') % For reproducibility eva = evalclusters(meas,'kmeans','CalinskiHarabasz','KList',1:6)```
```eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3 ```

The `OptimalK` value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Use an input matrix of proposed clustering solutions to evaluate the optimal number of clusters.

`load fisheriris;`

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use `kmeans` to create an input matrix of proposed clustering solutions for the sepal length measurements, using 1, 2, 3, 4, 5, and 6 clusters.

```clust = zeros(size(meas,1),6); for i=1:6 clust(:,i) = kmeans(meas,i,'emptyaction','singleton',... 'replicate',5); end```

Each row of `clust` corresponds to one sepal length measurement. Each of the six columns corresponds to a clustering solution containing 1 to 6 clusters.

Evaluate the optimal number of clusters using the Calinski-Harabasz criterion.

`eva = evalclusters(meas,clust,'CalinskiHarabasz')`
```eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3 ```

The `OptimalK` value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

Use a function handle to specify the clustering algorithm, then evaluate the optimal number of clusters.

`load fisheriris;`

The data contains length and width measurements from the sepals and petals of three species of iris flowers.

Use a function handle to specify the clustering algorithm.

```myfunc = @(X,K)(kmeans(X, K, 'emptyaction','singleton',... 'replicate',5));```

Evaluate the optimal number of clusters for the sepal length data using the Calinski-Harabasz criterion.

```eva = evalclusters(meas,myfunc,'CalinskiHarabasz',... 'klist',[1:6])```
```eva = CalinskiHarabaszEvaluation with properties: NumObservations: 150 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 513.9245 561.6278 530.4871 456.1279 469.5068] OptimalK: 3 ```

The `OptimalK` value indicates that, based on the Calinski-Harabasz criterion, the optimal number of clusters is three.

## Input Arguments

collapse all

Input data, specified as an N-by-P matrix. N is the number of observations, and P is the number of variables.

Data Types: `single` | `double`

Clustering algorithm, specified as one of the following.

 `'kmeans'` Cluster the data in `x` using the `kmeans` clustering algorithm, with `'EmptyAction'` set to `'singleton'` and `'Replicates'` set to `5`. `'linkage'` Cluster the data in `x` using the `clusterdata` agglomerative clustering algorithm, with `'Linkage'` set to `'ward'`. `'gmdistribution'` Cluster the data in `x` using the `gmdistribution` Gaussian mixture distribution algorithm, with `'SharedCov'` set to `true` and `'Replicates'` set to `5`.

If `criterion` is `'CalinskiHarabasz'`, `'DaviesBouldin'`, or `'silhouette'`, you can specify a clustering algorithm using a function handle. The function must be of the form `C = clustfun(DATA,K)`, where `DATA` is the data to be clustered, and `K` is the number of clusters. The output of `clustfun` must be one of the following:

• A vector of integers representing the cluster index for each observation in `DATA`. There must be `K` unique values in this vector.

• A numeric n-by-K matrix of score for n observations and K classes. In this case, the cluster index for each observation is determined by taking the largest score value in each row.

If `criterion` is `'CalinskiHarabasz'`, `'DaviesBouldin'`, or `'silhouette'`, you can also specify `clust` as a n-by-K matrix containing the proposed clustering solutions. n is the number of observations in the sample data, and K is the number of proposed clustering solutions. Column j contains the cluster indices for each of the N points in the jth clustering solution.

Data Types: `single` | `double` | `char` | `string` | `function_handle`

Clustering evaluation criterion, specified as one of the following.

 `'CalinskiHarabasz'` Create a `CalinskiHarabaszEvaluation` clustering evaluation object containing Calinski-Harabasz index values. For more information, see Calinski-Harabasz Criterion. `'DaviesBouldin'` Create a `DaviesBouldinEvaluation` cluster evaluation object containing Davies-Bouldin index values. For more information, see Davies-Bouldin Criterion. `'gap'` Create a `GapEvaluation` cluster evaluation object containing gap criterion values. For more information, see Gap Value. `'silhouette'` Create a `SilhouetteEvaluation` cluster evaluation object containing silhouette values. For more information, see Silhouette Value and Criterion.

### Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: `'KList',[1:5],'Distance','cityblock'` specifies to test 1, 2, 3, 4, and 5 clusters using the city block distance metric.

For All Criteria

collapse all

List of number of clusters to evaluate, specified as the comma-separated pair consisting of `'KList'` and a vector of positive integer values. You must specify `KList` when `clust` is a clustering algorithm name or a function handle. When `criterion` is `'gap'`, `clust` must be a character vector, a string scalar, or a function handle, and you must specify `KList`.

Example: `'KList',[1:6]`

Data Types: `single` | `double`

For Silhouette and Gap

collapse all

Distance metric used for computing the criterion values, specified as the comma-separated pair consisting of `'Distance'` and one of the following.

 `'sqEuclidean'` Squared Euclidean distance `'Euclidean'` Euclidean distance. This option is not valid for the `kmeans` clustering algorithm. `'cityblock'` Sum of absolute differences `'cosine'` One minus the cosine of the included angle between points (treated as vectors) `'correlation'` One minus the sample correlation between points (treated as sequences of values) `'Hamming'` Percentage of coordinates that differ. This option is only valid for the `Silhouette` criterion. `'Jaccard'` Percentage of nonzero coordinates that differ. This option is only valid for the `Silhouette` criterion.

For detailed information about each distance metric, see `pdist`.

You can also specify a function for the distance metric using a function handle. The distance function must be of the form `d2 = distfun(XI,XJ)`, where `XI` is a 1-by-n vector corresponding to a single row of the input matrix `X`, and `XJ` is an m2-by-n matrix corresponding to multiple rows of `X`. `distfun` must return an m2-by-1 vector of distances `d2`, whose kth element is the distance between `XI` and `XJ(k,:)`.

`Distance` only accepts a function handle if the clustering algorithm `clust` accepts a function handle as the distance metric. For example, the `kmeans` clustering algorithm does not accept a function handle as the distance metric. Therefore, if you use the `kmeans` algorithm and then specify a function handle for `Distance`, the software errors.

• If `criterion` is `'silhouette'`, you can also specify `Distance` as the output vector created by the function `pdist`.

• When `clust` is `'kmeans'` or `'gmdistribution'`, `evalclusters` uses the distance metric specified for `Distance` to cluster the data.

• If `clust` is `'linkage'`, and `Distance` is either `'sqEuclidean'` or `'Euclidean'`, then the clustering algorithm uses the Euclidean distance and Ward linkage.

• If `clust` is `'linkage'` and `Distance` is any other metric, then the clustering algorithm uses the specified distance metric and average linkage.

• In all other cases, the distance metric specified for `Distance` must match the distance metric used in the clustering algorithm to obtain meaningful results.

Example: `'Distance','Euclidean'`

Data Types: `single` | `double` | `char` | `string` | `function_handle`

For Silhouette Only

collapse all

Prior probabilities for each cluster, specified as the comma-separated pair consisting of `'ClusterPriors'` and one of the following.

 `'empirical'` Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points. Each cluster contributes to the overall silhouette value proportionally to its size. `'equal'` Compute the overall silhouette value for the clustering solution by averaging the silhouette values for all points within each cluster, and then averaging those values across all clusters. Each cluster contributes equally to the overall silhouette value, regardless of its size.

Example: `'ClusterPriors','empirical'`

For Gap Only

collapse all

Number of reference data sets generated from the reference distribution `ReferenceDistribution`, specified as the comma-separated pair consisting of `'B'` and a positive integer value.

Example: `'B',150`

Data Types: `single` | `double`

Reference data generation method, specified as the comma-separated pair consisting of `'ReferenceDistributions'` and one of the following.

 `'PCA'` Generate reference data from a uniform distribution over a box aligned with the principal components of the data matrix `x`. `'uniform'` Generate reference data uniformly over the range of each feature in the data matrix `x`.

Example: `'ReferenceDistribution','uniform'`

Method for selecting the optimal number of clusters, specified as the comma-separated pair consisting of `'SearchMethod'` and one of the following.

 `'globalMaxSE'` Evaluate each proposed number of clusters in `KList` and select the smallest number of clusters satisfying `$\text{Gap}\left(K\right)\ge GAPMAX-\text{SE}\left(GAPMAX\right),$`where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, GAPMAX is the largest gap value, and SE(GAPMAX) is the standard error corresponding to the largest gap value. `'firstMaxSE'` Evaluate each proposed number of clusters in `KList` and select the smallest number of clusters satisfying `$\text{Gap}\left(K\right)\ge \text{Gap}\left(K+1\right)-\text{SE}\left(K+1\right),$`where K is the number of clusters, Gap(K) is the gap value for the clustering solution with K clusters, and SE(K + 1) is the standard error of the clustering solution with K + 1 clusters.

Example: `'SearchMethod','globalMaxSE'`

## Output Arguments

collapse all

Clustering evaluation data, returned as a clustering evaluation object.

## Version History

Introduced in R2013b