# Prediction of tumor class from gene expression data using bagged decision trees

Small, round blue-cell tumors (SRBCTs) belong to four distinct diagnostic categories. The categories have widely differing prognoses and treatment options, making it extremely important that doctors are able to classify the tumor category quickly and accurately.

SRBCTs are difficult to distinguish by light microscopy, and are currently diagnosed using a combination of immunohistochemistry, cytogenetics, interphase fluorescence *in situ* hybridisation, and RT-PCR. Currently no single test can precisely distinguish these tumors.

Khan *et al.* [1] showed that gene expression data collected using cDNA microarrays holds promise for SRBCT diagnosis as, in contrast to the other techniques mentioned, it allows the simultaneous measurement of multiple markers.

Gene expression studies, however, give rise to very high-dimensional datasets, requiring the use of multivariate statistical methods for analysis.

The MathWorks tools provide a very wide variety of such methods (including the neural networks used by Khan *et al.*), allowing users to choose the most appropriate method for their particular application.

Here we apply an ensemble decision tree approach. Ensembles of decision trees have been described as "the best off-the-shelf classifier in the world" [2] for their combination of performance, interpretability, and ease-of-use.

This demo requires MATLAB®, Statistics Toolbox™, Curve Fitting Toolbox™ and Parallel Computing Toolbox™.

## Contents

- Set Random Number Generator and Seed
- Load Data
- Examine the data
- Examine SRBCT classes in the training set
- Examine SRBCT classes in the test set
- Build a first model, using bagged decision trees.
- Display the model
- Display individual trees
- Evaluate the first model
- Compare actual classes to predicted classes
- Examine the confidence in the class predictions
- Examine out-of-bag classifications
- Examine confidence of out-of-bag classifications
- Calculate proximities and perform multidimensional scaling
- Examine importance of MDS coordinates
- Display proximities
- Examine the importance of individual variables
- Select contributing variables
- Select top variables
- Display top variables
- Select top twenty-five variables
- Train final model
- Evaluate the model on the test dataset
- Display classifications for SRBCT samples
- Display classifications for all samples
- Display prediction scores
- Classification thresholds
- Performance curves
- Apply costs
- Reference

## Set Random Number Generator and Seed

Because we'll be using methods later on that run MATLAB in parallel, we'll set the RNG to use a method that supports parallel substreams, allowing all parallel MATLABs to use the same random number stream.

Although the analysis is very robust to small changes, we also fix the random seed to ensure the demo runs exactly the same each time. The need to ensure replicability of results is common in regulatory environments.

RandStream.setDefaultStream(RandStream('mlfg6331_64','seed',27));

## Load Data

The data is stored for convenience in a .mat file, but could be loaded in from other sources, such as a text or .csv file, a Microsoft® Excel® file or, with Bioinformatics Toolbox™, directly from Affymetrix® files or other supported microarray file formats.

```
load experimentdata
```

## Examine the data

We have 88 samples, which have been divided into a training set of 63 samples, which we will build our model with; and a test set of 25 samples, which we will use to evaluate our model. Each sample is supplied with expression profiles of 2308 genes, collected using cDNA microarrays. The class of tumor to which the samples belong is also provided.

whos

Name Size Bytes Class Attributes geneexpTest 25x2308 461600 double geneexpTrain 63x2308 1163232 double tumortypeTest 25x1 1600 cell tumortypeTrain 63x1 4032 cell

## Examine SRBCT classes in the training set

SRBCTs fall into four distinct classes: neuroblastoma (NB), rhabdomyosarcoma (RM), Ewing family of tumors (EW), and non-Hodgkin lymphoma, of which Burkitt lymphoma (BL) is a subset.

disp(unique(tumortypeTrain)')

'BL' 'EW' 'NB' 'RM'

## Examine SRBCT classes in the test set

Note that in the training set, all samples are from the tumor types EW, BL, NB and RM. In the test set, additional samples are include that belong to none of the training classes. These are labelled NA.

disp(unique(tumortypeTest)')

'BL' 'EW' 'NA' 'NB' 'RM'

## Build a first model, using bagged decision trees.

We'll use 100 trees to start with. For our final model we will use a lot more than this, but our aim at the moment is to firstly get a feel for how good a classification we might get, and secondly to see which variables are most important; we can then build our final model with more trees, but using only the most relevant variables.

model1 = TreeBagger(100,... geneexpTrain,... tumortypeTrain,... 'nprint',10);

Tree 10 done. Tree 20 done. Tree 30 done. Tree 40 done. Tree 50 done. Tree 60 done. Tree 70 done. Tree 80 done. Tree 90 done. Tree 100 done.

## Display the model

We can display overall information about the model we have built. For example, we can see how many trees it consists of, how many variables there are, the class names, and various options used in its construction.

disp(model1)

Ensemble with 100 bagged decision trees: Training X: [63x2308] Training Y: [63x1] Method: classification Nvars: 2308 NVarToSample: 49 MinLeaf: 1 FBoot: 1 SampleWithReplacement: 1 ComputeOOBPrediction: 0 ComputeOOBVarImp: 0 Proximity: [] ClassNames: 'EW' 'BL' 'NB' 'RM'

## Display individual trees

We can also extract individual trees from the model, and display them both at the command line:

disp(model1.Trees{1})

Decision tree for classification 1 if x1888<0.74995 then node 2 else node 3 2 if x1011<0.2436 then node 4 else node 5 3 if x2157<1.90815 then node 6 else node 7 4 if x1453<0.6056 then node 8 else node 9 5 class = EW 6 class = RM 7 class = NB 8 if x261<1.7618 then node 10 else node 11 9 class = BL 10 class = EW 11 class = RM

and in the decision tree view tool:

view(model1.Trees{1})

## Evaluate the first model

Before we approach the test set, we'll see how well the model has classified the training set. This is not a good estimate of the accuracy the model would have on a test set; it is biased because the classification is being done on the same samples used for training. However, it is an indicator of whether a successful classification of test data may be possible.

[tumortypeTrainPred, tumortypeTrainPredScores] = predict(model1,geneexpTrain);

## Compare actual classes to predicted classes

We display a confusion matrix, which contains a count of the samples, broken down by their actual and predicted classes. The rows represent the samples' actual class, and the columns the predicted class. Numbers on the diagonal indicate a correct prediction; off-diagonal entries indicate a misclassification.

The model is successfully classifying all samples in the training set, which is good but not unexpected.

```
[conf, classorder] = confusionmat(tumortypeTrain,tumortypeTrainPred);
disp(dataset({conf,classorder{:}},'obsnames',classorder));
```

EW BL NB RM EW 23 0 0 0 BL 0 8 0 0 NB 0 0 12 0 RM 0 0 0 20

## Examine the confidence in the class predictions

As well as the predicted classes, the `predict` function outputs a second result, the *scores*. The *score* of an observation on a class is the probability that it comes from that class, as indicated by the model. These probabilities are used by the model when making classifications; the class with the highest probability is the one predicted.

The scores can be used to give a measure of the confidence the model has in a particular prediction; if a sample has a much higher probability of being one class than all the others, than the prediction will be fairly solid. But if the predicted class is only a little more probable than all the others, the prediction is more shaky.

The scores can be conveniently displayed in a parallel coordinates plot. The four values on the x-axis are the four classes into which samples can be classified. Each sample is represented by a line, and the height of that line at each x-axis value gives the probability that the sample comes from that class. It is clear that all the samples are being predicted with high confidence (but again, because they are training, not test samples, this is to be expected).

parallelcoords(tumortypeTrainPredScores,'group',tumortypeTrain,... 'labels',{'EW','BL','NB','RM'}) xlabel('Predicted Class') ylabel('Probability')

## Examine out-of-bag classifications

As well as examining how well the model classifies the training samples, the bagging process allows us to get a sneak peek at how well the model is likely to perform on a test set. We can achieve this by applying the model to classify the *out-of-bag* samples.

What are out-of-bag samples? Remember that this model has been built by creating 100 resampled datasets, and building a decision tree model on each. In each of those resampled datasets, some samples will not have been selected for model-building; these are the out-of-bag samples, and can be used like a mini-test set for each submodel. So by classifying the out-of-bag samples, we can get an idea of what the test performance of the classifier will be like.

We need to rebuild the model with the `oobpred` option, which records which samples were out-of-bag. It takes longer to do this, so only set this option if you want to perform out-of-bag testing.

model2 = TreeBagger(100,... geneexpTrain,... tumortypeTrain,... 'oobpred','on',... 'nprint',10);

Tree 10 done. Tree 20 done. Tree 30 done. Tree 40 done. Tree 50 done. Tree 60 done. Tree 70 done. Tree 80 done. Tree 90 done. Tree 100 done.

The model also correctly classifies all out-of-bag samples.

```
[tumortypeOOB, tumortypeOOBScores] = oobPredict(model2);
[conf,classorder] = confusionmat(tumortypeTrain,tumortypeOOB);
disp(dataset({conf,classorder{:}},'obsnames',classorder));
```

EW BL NB RM EW 23 0 0 0 BL 0 8 0 0 NB 0 0 12 0 RM 0 0 0 20

## Examine confidence of out-of-bag classifications

However, the predictions are being made with a lot less confidence, as you'd expect. Some samples have less than 50% probability of being in their predicted class, and are close to being incorrectly classified.

parallelcoords(tumortypeOOBScores,'group',tumortypeTrain,... 'labels',{'EW','BL','NB','RM'}) xlabel('Predicted Class') ylabel('Probability')

## Calculate proximities and perform multidimensional scaling

Another view on how well the model is classifying samples is to examine the *proximity matrix*. The *proximity* of two samples is the proportion of trees in the model that classify them in the same way. If the model is classifying well, samples of the same class are likely to have high proximities, and samples of different classes are likely to have low proximities.

The proximity matrix can be conveniently displayed using multidimensional scaling.

model2=fillProximities(model2); [mdsScores,ev] = mdsProx(model2);

## Examine importance of MDS coordinates

It's clear that the first three coordinates are much more important than the others.

bar(ev(1:10)) xlabel('MDS coordinate') ylabel('Importance')

## Display proximities

We therefore display the proximities in a 3D scatter plot. The classes are very well separated.

h = scatter3(mdsScores(:,1),mdsScores(:,2),mdsScores(:,3),... 5,grp2idx(tumortypeTrain)); xlabel('MDS 1') ylabel('MDS 2') zlabel('MDS 3') h2 = get(h,'Children'); legend(h2([1,24,35,44]),{'EW','BL','NB','RM'})

## Examine the importance of individual variables

We would like to evaluate the importance of individual variables to the model. If we can find a subset of variables that are more important than the rest, then we can build a reduced model that will be easier to understand than a full model, whilst still predicting accurately.

There are a couple of different ways to do this. The first is relatively quick to calculate, and measures the amount each variable improves the *split criterion*; this is the quantity that the decision trees try to maximize, when they select variables to put at nodes in the tree.

If we calculate these contributions, and plot them in order, we can see that only a few hundred variables of the 2038 are making any sort of contribution to the model.

[sortedDeltaCritDecisionSplit, sortedVars] = sort(model2.DeltaCritDecisionSplit, 'descend'); bar(sortedDeltaCritDecisionSplit) set(gca, 'XLim', [0,size(geneexpTrain,2)]) xlabel('Variables') ylabel('Improvement in split criterion')

## Select contributing variables

We arbitrarily choose a cutoff, and select the 200 variables with the highest contribution to continue with.

topvars = sortedVars(1:200);

## Select top variables

The second method for estimating the importance of variables takes a lot longer, but gives a direct measure of how much the variables affects the classification error, rather than the indirect measure of contribution to split criterion.

The method works by randomly permuting each variable's values on the out-of-bag samples, and seeing how much the prediction error increases compared to the real predictions. If the prediction error is not much worse when the variable is randomly permuted, then the variable is not likely to be very important to prediction.

To carry this out, we retrain with the `oobvarimp` option. Because the permutation test takes quite a long time, we'll speed things up by making use of the integration between Statistics Toolbox and Parallel Computing Toolbox. Prior to this demonstration, I've opened up two MATLAB workers using the `matlabpool` command. If we switch on the `UseParallel` option, the tree building will be done in parallel, spread across these workers. I have a dual-core machine, so I'm running only two workers for a 2x speedup (notice that the Task Manager goes to 100% during this operation, not just 50% like it has been so far). If you have a quad-core or 8-core machine, or access to a cluster, you could get much greater speedups.

% matlabpool open 2 options = statset('UseParallel','always',... 'Streams',RandStream.getDefaultStream,... 'UseSubStreams','always'); model3 = TreeBagger(50,... geneexpTrain(:,topvars),... tumortypeTrain,... 'oobvarimp','on',... 'options',options,... 'nprint',10);

10 Trees done on worker 1. 10 Trees done on worker 2. 20 Trees done on worker 2. 20 Trees done on worker 1.

We'll do a quick check to ensure that the rebuilt model is not predicting any worse, despite being built with fewer trees. It doesn't seem to be.

```
[tumortypeTrainPred, tumortypeTrainPredScores] = predict(model3,geneexpTrain(:,topvars));
[conf,classorder] = confusionmat(tumortypeTrain,tumortypeTrainPred);
disp(dataset({conf,classorder{:}},'obsnames',classorder));
```

EW BL NB RM EW 23 0 0 0 BL 0 8 0 0 NB 0 0 12 0 RM 0 0 0 20

## Display top variables

We again display the variables, sorted by the effect they have on the prediction error.

[sortedOOBPermutedVarDeltaError, sortedVars2] = sort(model3.OOBPermutedVarDeltaError, 'descend'); bar(sortedOOBPermutedVarDeltaError) set(gca, 'XLim', [0,200]) xlabel('Variables') ylabel('Effect on prediction error')

## Select top twenty-five variables

It's not entirely clear from this plot how many variables we should select, so we arbitrarily choose the top 25.

Note that this is a small enough number that, if we continue to get good predictive accuracy, opens up new diagnostic possibilities; for example, rather than continuing to measure samples with a cDNA microarray, we could consider creating a custom assay to measure these 25 genes specifically, and more accurately.

bestvars = topvars(sortedVars2(1:25));

## Train final model

We buld a final model with just the top 25 variables. Because this is the final model, we'll build it with 500 trees.

model4 = TreeBagger(500,... geneexpTrain(:,bestvars),... tumortypeTrain,... 'nprint',100);

Tree 100 done. Tree 200 done. Tree 300 done. Tree 400 done. Tree 500 done.

## Evaluate the model on the test dataset

Finally we use the test dataset to evaluate our model.

[tumortypeTestPred, tumortypeTestPredScores] = predict(model4,geneexpTest(:,bestvars));

## Display classifications for SRBCT samples

Recall that the test dataset includes both SRBCT samples, and additional samples that are entirely different. First we'll examine the predictions on the SRBCT samples. All but one are correctly classified.

isSRBCT = ~strcmp(tumortypeTest,'NA'); [conf,classorder] = confusionmat(tumortypeTest(isSRBCT),tumortypeTestPred(isSRBCT)); disp(dataset({conf,classorder{:}},'obsnames',classorder));

NB EW RM BL NB 6 0 0 0 EW 0 5 1 0 RM 0 0 5 0 BL 0 0 0 3

## Display classifications for all samples

The model is only capable of predicting classes that it has seen, so the non-SRBCT samples are of course classified to one of the SRBCT classes. Class NA is the middle row/column in the table.

```
[conf,classorder] = confusionmat(tumortypeTest,tumortypeTestPred);
disp(dataset({conf,classorder{:}},'obsnames',classorder));
```

NB EW NA RM BL NB 6 0 0 0 0 EW 0 5 0 1 0 NA 1 1 0 2 1 RM 0 0 0 5 0 BL 0 0 0 0 3

## Display prediction scores

However, if we examine the plot of the prediction scores, it is clear that while most of the predictions for the SRBCT samples are being made with fairly high confidence, the predictions for the non-SRBCT samples are being made much less confidently - the probabilities for each class are much more evenly spread.

For ease of visualisation, only the median values for each class are displayed.

parallelcoords(tumortypeTestPredScores,'group',tumortypeTest,... 'labels',{'EW','BL','NB','RM'},'Quantile',0.5) xlabel('Predicted Class') ylabel('Probability')

## Classification thresholds

At the moment we are simply picking the class with the highest probability, and saying that we predict the sample to be of that class. But this is leading the NA samples to be classified to one of the SRBCT classes - one of them has a highest probability, even if it is low.

This leads us to consider whether we should apply a threshold on probability when we assign a sample to a class; for example, we could say that only samples with higher than 70% probability of being EW would be predicted as EW. But what threshold to choose?

Unfortunately there is no straightforward answer to this. Let's take a look at the full scores plot to see more detail.

For BL, we could choose any threshold between about 38% and 67%, and we would get perfect classification (into BL or not-BL). Similarly for NB we could pick any threshold between about 30% and 64% and get perfect prediction (of NB vs. not-NB).

But for EW and RM, there is no threshold that we can choose that will not misclassify at least one sample. Note the samples highlighted, and focus on the left-hand side of the plot, where an NA and an RM sample are higher than an EW sample. If we pick a classification threshold for EW above 32%, we will misclassify the EW sample as not-EW; but if we pick a threshold below 44%, we will misclassify the RM sample and possibly the NA sample as EW.

parallelcoords(tumortypeTestPredScores,'group',tumortypeTest,... 'labels',{'EW','BL','NB','RM'}) h = get(gca,'Children'); set(h([7,13,15]),'LineWidth',3); xlabel('Predicted Class') ylabel('Probability')

## Performance curves

Our choice of threshold must therefore be guided by the relative importance we place on false negative vs. false positive classifications. Except in fortunate cases such as BL and NB, it is common for this tradeoff between false negative and false positive predictions to be necessary. In medical contexts, this tradeoff is usually expressed as a tradeoff between the *sensitivity* and the *specificity* of a test.

The choice is best visualised using a performance curve plot. We can see that as we decrease the threshold, the predictions become more sensitive (correctly predict more EW samples as EW) but less specific (incorrectly predict more non-EW samples to be EW).

[fpr,tpr] = perfcurve(tumortypeTest,tumortypeTestPredScores(:,1),'EW'); plot(fpr,tpr) set(gca,'XLim',[-0.02,1],'YLim',[0,1.02]) xlabel('1 - Specificity') ylabel('Sensitivity') title('Performance curve for EW classification')

## Apply costs

If we can somehow place a specific value on the relative costs of a false negative prediction and a false positive prediction, then we can make a choice of classification threshold that minimises this cost.

Not being a medical doctor, I have no idea what this cost might be; but to demonstrate the principle, let's assume that a false negative (predicting an EW sample as non-EW) is twice as bad as a false positive (predicting a non-EW sample as EW); perhaps because treatment for EW samples might require more urgency than other SRBCT diagnoses (or indeed an NA).

Note that this is a sensible approach if we do not have a strong idea ahead of time what the costs are likely to be. If we did, we could provide them directly to the bagged decision tree algorithm via the `costs` option, and it will then minimise the expected cost of the model, rather than the expected error. Additionally, if we know ahead of time the relative abundances of the tumor categories, the algorithm can take these into account via the `priors` option.

cost = [0,2;1,0]; [fpr,ecost,thre] = perfcurve(tumortypeTest,tumortypeTestPredScores(:,1),'EW',... 'ycrit','ecost','cost',cost); plot(thre,ecost) xlabel('Threshold for classification') ylabel('Expected Cost') title('Performance curve for EW classification') hold on; smoothedcost = fit(thre,ecost,fittype('poly4')); h = plot(0:0.01:1,smoothedcost(0:0.01:1),'k:'); [mincost, minindex] = min(smoothedcost(0:0.01:1)); h(2) = plot((minindex-1)/100,mincost,'r*'); legend(h,{'Smoothed Cost','Best Threshold'})

## Reference

[1] Khan J *et al.*, Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. *Nature Medicine* 7(6):673-9, 2001

[2] Breiman L, Arcing classifiers. *Ann Stat* 26:801–49, 1998