Main Content


Train R-CNN deep learning object detector



detector = trainRCNNObjectDetector(trainingData,network,options) trains an R-CNN (regions with convolutional neural networks) based object detector. The function uses deep learning to train the detector to detect multiple object classes.

This implementation of R-CNN does not train an SVM classifier for each object class.

This function requires that you have Deep Learning Toolbox™ and Statistics and Machine Learning Toolbox™. It is recommended that you also have Parallel Computing Toolbox™ to use with a CUDA®-enabled NVIDIA® GPU. For information about the supported compute capabilities, see GPU Computing Requirements (Parallel Computing Toolbox).

detector = trainRCNNObjectDetector(___,Name,Value) returns a detector object with optional input properties specified by one or more Name,Value pair arguments.

detector = trainRCNNObjectDetector(___,'RegionProposalFcn',proposalFcn) optionally trains an R-CNN detector using a custom region proposal function.

[detector,info] = trainRCNNObjectDetector(___) also returns information on the training progress, such as training loss and accuracy, for each iteration.

detector = trainRCNNObjectDetector(___,Name=Value) uses additional options specified by one or more name-value pair arguments and any of the previous inputs.


collapse all

Load training data and network layers.

load('rcnnStopSigns.mat', 'stopSigns', 'layers')

Add the image directory to the MATLAB path.

imDir = fullfile(matlabroot, 'toolbox', 'vision', 'visiondata',...

Set network training options to use mini-batch size of 32 to reduce GPU memory usage. Lower the InitialLearnRate to reduce the rate at which network parameters are changed. This is beneficial when fine-tuning a pre-trained network and prevents the network from changing too rapidly.

options = trainingOptions('sgdm', ...
  'MiniBatchSize', 32, ...
  'InitialLearnRate', 1e-6, ...
  'MaxEpochs', 10);

Train the R-CNN detector. Training can take a few minutes to complete.

rcnn = trainRCNNObjectDetector(stopSigns, layers, options, 'NegativeOverlapRange', [0 0.3]);
Training an R-CNN Object Detector for the following object classes:

* stopSign

--> Extracting region proposals from 27 training images...done.

--> Training a neural network to classify objects in training data...

Training on single CPU.
Initializing input data normalization.
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |   Accuracy   |     Loss     |      Rate       |
|       1 |           1 |       00:00:01 |       96.88% |       0.1651 |      1.0000e-06 |
|       2 |          50 |       00:00:15 |       96.88% |       0.0807 |      1.0000e-06 |
|       3 |         100 |       00:00:27 |       96.88% |       0.1340 |      1.0000e-06 |
|       5 |         150 |       00:00:38 |       96.88% |       0.0225 |      1.0000e-06 |
|       6 |         200 |       00:00:49 |       93.75% |       0.6584 |      1.0000e-06 |
|       8 |         250 |       00:01:00 |       93.75% |       0.5233 |      1.0000e-06 |
|       9 |         300 |       00:01:11 |      100.00% |   2.9456e-05 |      1.0000e-06 |
|      10 |         350 |       00:01:23 |      100.00% |       0.0009 |      1.0000e-06 |
Training finished: Max epochs completed.

Network training complete.

--> Training bounding box regression models for each object class...100.00%...done.

Detector training complete.

Test the R-CNN detector on a test image.

img = imread('stopSignTest.jpg');

[bbox, score, label] = detect(rcnn, img, MiniBatchSize=32);

Display strongest detection result.

[score, idx] = max(score);

bbox = bbox(idx, :);
annotation = sprintf('%s: (Confidence = %f)', label(idx), score);

detectedImg = insertObjectAnnotation(img, 'rectangle', bbox, annotation);


Remove the image directory from the path.


Resume training an R-CNN object detector using additional data. To illustrate this procedure, half the ground truth data will be used to initially train the detector. Then, training is resumed using all the data.

Load training data and initialize training options.

load('rcnnStopSigns.mat', 'stopSigns', 'layers')

stopSigns.imageFilename = fullfile(toolboxdir('vision'),'visiondata', ...

options = trainingOptions('sgdm', ...
    'MiniBatchSize', 32, ...
    'InitialLearnRate', 1e-6, ...
    'MaxEpochs', 10, ...
    'Verbose', false);

Train the R-CNN detector with a portion of the ground truth.

rcnn = trainRCNNObjectDetector(stopSigns(1:10,:), layers, options, 'NegativeOverlapRange', [0 0.3]);

Get the trained network layers from the detector. When you pass in an array of network layers to trainRCNNObjectDetector, they are used as-is to continue training.

network = rcnn.Network;
layers = network.Layers;

Resume training using all the training data.

rcnnFinal = trainRCNNObjectDetector(stopSigns, layers, options);

Create an R-CNN object detector for two object classes: dogs and cats.

objectClasses = {'dogs','cats'};

The network must be able to classify both dogs, cats, and a "background" class in order to be trained using trainRCNNObjectDetector. In this example, a one is added to include the background.

numClassesPlusBackground = numel(objectClasses) + 1;

The final fully connected layer of a network defines the number of classes that the network can classify. Set the final fully connected layer to have an output size equal to the number of classes plus a background class.

layers = [ ...
    imageInputLayer([28 28 1])

These network layers can now be used to train an R-CNN two-class object detector.

Create an R-CNN object detector and set it up to use a saved network checkpoint. A network checkpoint is saved every epoch during network training when the trainingOptions 'CheckpointPath' parameter is set. Network checkpoints are useful in case your training session terminates unexpectedly.

Load the stop sign training data.


Add full path to image files.

stopSigns.imageFilename = fullfile(toolboxdir('vision'),'visiondata', ...

Set the 'CheckpointPath' using the trainingOptions function.

checkpointLocation = tempdir;
options = trainingOptions('sgdm','Verbose',false, ...

Train the R-CNN object detector with a few images.

rcnn = trainRCNNObjectDetector(stopSigns(1:3,:),layers,options);

Load a saved network checkpoint.

wildcardFilePath = fullfile(checkpointLocation,'convnet_checkpoint__*.mat');
contents = dir(wildcardFilePath);

Load one of the checkpoint networks.

filepath = fullfile(contents(1).folder,contents(1).name);
checkpoint = load(filepath);
ans = 

  SeriesNetwork with properties:

    Layers: [15×1 nnet.cnn.layer.Layer]

Create a new R-CNN object detector and set it up to use the saved network.

rcnnCheckPoint = rcnnObjectDetector();
rcnnCheckPoint.RegionProposalFcn = @rcnnObjectDetector.proposeRegions;

Set the Network to the saved network checkpoint.

rcnnCheckPoint.Network =
rcnnCheckPoint = 

  rcnnObjectDetector with properties:

              Network: [1×1 SeriesNetwork]
           ClassNames: {'stopSign'  'Background'}
    RegionProposalFcn: @rcnnObjectDetector.proposeRegions

Input Arguments

collapse all

Labeled ground truth images, specified as a table with two or more columns.

If you use a table, the table must have two or more columns. The first column of the table must contain image file names with paths. The images must be grayscale or truecolor (RGB) and they can be in any format supported by imread. Each of the remaining columns must be a cell vector that contains M-by-4 matrices that represent a single object class, such as vehicle, flower, or stop sign. The columns contain 4-element double arrays of M bounding boxes in the format [x,y,width,height]. The format specifies the upper-left corner location and size of the bounding box in the corresponding image. To create a ground truth table, you can use the Image Labeler app or Video Labeler app. To create a table of training data from the generated ground truth, use the objectDetectorTrainingData function.

The table variable name defines the object class name. To create the ground truth table, use the Image Labeler app. Boxes smaller than 32-by-32 are not used for training.

Network, specified as a SeriesNetwork (Deep Learning Toolbox), an array of Layer (Deep Learning Toolbox) objects, a layerGraph (Deep Learning Toolbox) object, or by the network name. The network is trained to classify the object classes defined in the trainingData table. The SeriesNetwork (Deep Learning Toolbox), Layer (Deep Learning Toolbox), and layerGraph (Deep Learning Toolbox) objects are available in the Deep Learning Toolbox.

  • When you specify the network as a SeriesNetwork, an array of Layer objects, or by the network name, the network is automatically transformed into a R-CNN network by adding new classification and regression layers to support object detection.

  • The array of Layer (Deep Learning Toolbox) objects must contain a classification layer that supports the number of object classes, plus a background class. Use this input type to customize the learning rates of each layer. An example of an array of Layer (Deep Learning Toolbox) objects:

    layers = [imageInputLayer([28 28 3])
            convolution2dLayer([5 5],10)

  • When you specify the network as SeriesNetwork, Layer array, or network by name, the weights for convolution and fully-connected layers are initialized to 'narrow-normal'.

  • The network name must be one of the following valid networks names. You must also install the corresponding Add-on.

  • The LayerGraph object must be a valid R-CNN object detection network. You can also use a LayerGraph object to train a custom R-CNN network.

See Getting Started with R-CNN, Fast R-CNN, and Faster R-CNN to learn more about how to create a R-CNN network.

Training options, returned by the trainingOptions (Deep Learning Toolbox) function from the Deep Learning Toolbox. To specify solver and other options for network training, use trainingOptions.


trainRCNNObjectDetector does not support these training options:

  • The ValidationData, ValidationFrequency, or ValidationPatience options

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: PositiveOverlapRange=[0.5 1]

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: "PositiveOverlapRange",[0.5 1]

Positive training sample ratios for range of bounding box overlap, specified as a two-element vector. The vector contains values in the range [0,1]. Region proposals that overlap with ground truth bounding boxes within the specified range are used as positive training samples.

The overlap ratio used for both the PositiveOverlapRange and NegativeOverlapRange is defined as:


A and B are bounding boxes.

Negative training sample ratios for range of bounding box overlap, specified as a two-element vector. The vector contains values in the range [0,1]. Region proposals that overlap with the ground truth bounding boxes within the specified range are used as negative training samples.

Maximum number of strongest region proposals to use for generating training samples, specified as an integer. Reduce this value to speed up processing time, although doing so decreases training accuracy. To use all region proposals, set this value to inf.

Custom region proposal function handle, specified as a function handle. If you do not specify a custom region proposal function, the default variant of the Edge Boxes algorithm [3], set in rcnnObjectDetector, is used. A custom proposalFcn must have the following functional form:

 [bboxes,scores] = proposalFcn(I)

The input, I, is an image defined in the groundTruth table. The function must return rectangular bounding boxes in an M-by-4 array. Each row of bboxes contains a four-element vector, [x,y,width,height], that specifies the upper–left corner and size of a bounding box in pixels. The function must also return a score for each bounding box in an M-by-1 vector. Higher scores indicate that the bounding box is more likely to contain an object. The scores are used to select the strongest regions, which you can specify in NumStrongestRegions.

Box regression layer name, specified as a character vector. Valid values are 'auto' or the name of a layer in the input network. The output activations of this layer are used as features to train a regression model for refining the detected bounding boxes.

If the name is 'auto', then trainRCNNObjectDetector automatically selects a layer from the input network based on the type of input network:

  • If the input network is a SeriesNetwork or an array of Layer objects, then the function selects the last convolution layer.

  • If the input network is a LayerGraph, then the function selects the source of the last fully connected layer.

Detector training experiment monitoring, specified as an experiments.Monitor (Deep Learning Toolbox) object for use with the Experiment Manager (Deep Learning Toolbox) app. You can use this object to track the progress of training, update information fields in the training results table, record values of the metrics used by the training, and to produce training plots.

Information monitored during training:

  • Training loss at each iteration.

  • Training accuracy at each iteration.

  • Training root mean square error (RMSE) for the box regression layer.

  • Learning rate at each iteration.

Output Arguments

collapse all

Trained R-CNN-based object detector, returned as an rcnnObjectDetector object. You can train an R-CNN detector to detect multiple object classes.

Training information, returned as a structure with the following fields. Each field is a numeric vector with one element per training iteration. Values that have not been calculated at a specific iteration are represented by NaN.

  • TrainingLoss — Training loss at each iteration. This is the combination of the classification and regression loss used to train the R-CNN network.

  • TrainingAccuracy — Training set accuracy at each iteration

  • BaseLearnRate — Learning rate at each iteration


  • This implementation of R-CNN does not train an SVM classifier for each object class.


  • To accelerate data preprocessing for training, trainRCNNObjectDetector automatically creates and uses a parallel pool based on your parallel preference settings. This requires Parallel Computing Toolbox.

  • VGG-16, VGG-19, ResNet-101, and Inception-ResNet-v2 are large models. Training with large images may produce "Out of Memory" errors. To mitigate these errors, manually resize the images along with the bounding box ground truth data before calling trainRCNNObjectDetector.

  • This function supports transfer learning. When a network is input by name, such as 'resnet50', then the software automatically transforms the network into a valid R-CNN network model based on the pretrained resnet50 (Deep Learning Toolbox) model. Alternatively, manually specify a custom R-CNN network using the LayerGraph (Deep Learning Toolbox) extracted from a pretrained DAG network. See Create R-CNN Object Detection Network.

  • Use the trainingOptions (Deep Learning Toolbox) function to enable or disable verbose printing.


[1] Girshick, R., J. Donahue, T. Darrell, and J. Malik. “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 580–587.

[2] Girshick, R. “Fast R-CNN.” Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1440–1448.

[3] Zitnick, C. Lawrence, and P. Dollar. “Edge Boxes: Locating Object Proposals from Edges.” Computer Vision-ECCV, Springer International Publishing. 2014, pp. 391–405.

Extended Capabilities

Version History

Introduced in R2016b