Automate Ground Truth Labeling for Object Tracking and Re-Identification
This example shows how to create an automation algorithm to automatically label data for object tracking and for object re-identification.
Overview
The Image Labeler (Computer Vision Toolbox), Video Labeler (Computer Vision Toolbox), and Ground Truth Labeler (Automated Driving Toolbox) apps provide a convenient interface to interactively label data for various computer vision tasks. These apps include built-in automation algorithms to accelerate the labeling and also let you specify your own custom labeling algorithm. For an introduction to creating your own automation algorithm, see the Create Automation Algorithm (Computer Vision Toolbox) example.
This example extends the Automate Ground Truth Labeling for Object Detection (Computer Vision Toolbox) example by incorporating a multi-object tracker from the Sensor Fusion and Tracking Toolbox (SFTT). The multi-object tracker automatically assigns identifiers to objects that can be used to evaluate and verify multi-object tracking systems, as well as train and evaluate object re-identification (ReID) networks.
To learn how to perform object tracking and re-identification of objects across multiple frames, see the Reidentify People Throughout a Video Sequence Using ReID Network (Computer Vision Toolbox) example.
Define Automation Algorithm
Implement a track-by-detection algorithm using a pretrained object detector combined with a Global Nearest Neighbor multi-object tracker. Use a pretrained yolov4ObjectDetector
(Computer Vision Toolbox) detector to detect pedestrians in a video. You can use other object detectors, such as yoloxObjectDetector
(Computer Vision Toolbox) or ssdObjectDetector
(Computer Vision Toolbox), depending on the type of objects that need to be tracked, for tracking only person specifically use yolov2TransformLayer
(Computer Vision Toolbox). To learn more about about multi-object trackers, see the Implement Simple Online and Realtime Tracking example.
Define the automation algorithm using the initialize
and run
functions in the helperObjectDetectorTracker
class. This class inherits from vision.labeler.AutomationAlgorithm
and vision.labeler.mixin.Temporal
. Define the inherited properties Name
, Description
, and UserDirections
, as well as the following algorithm properties:
classdef helperObjectDetectorTracker < vision.labeler.AutomationAlgorithm & vision.labeler.mixin.Temporal properties Detector % Object detector Tracker % Multi-Object Tracker OverlapThreshold = 0.5 % Threshold for Non-Maximum Suppression ScoreThreshold = 0.5 % Threshold before tracking FrameRate = 1 % Frame-rate of the video ProcessNoiseIntensity = [1e-4 1e-4 1e-5 1e-5] % Tracker's velocity model process noise end end
In the initialize
function, create the YOLO v4 object detector using the yolov4ObjectDetector
(Computer Vision Toolbox) object. Create the multi-object tracker using the trackerGNN
function with the following options:
FilterInitializationFcn
: Use theinitvisionbboxkf
function to initialize a linear Kalman Filter with Constant-Velocity bounding box state definition. Specify the video frame rate and frame width and height using name-value arguments. Call the internal functioninitFcn
of thehelperObjectDetectorTracker
class to useinitivisionbboxkf
and set the state estimation error covariance.ConfirmationThreshold
: Use [2 2] to confirm the existence of a true object if it is detected and assigned in 2 consecutive frames.DeletionThreshold
: Use [3 3] to delete object tracks after 3 consecutive missed frames.
function initialize(this, frame, ~) [height,width] = size(frame); % Initialize YOLO v4 object detector this.Detector = yolov4ObjectDetector("csp-darknet53-coco"); % Initialize tracker noiseIntensity = this.ProcessNoiseIntensity; this.Tracker = trackerGNN(FilterInitializationFcn=@(x) initFcn(x, this.FrameRate, width, height, noiseIntensity), ... ConfirmationThreshold=[2 2], ... DeletionThreshold=[5 5], ... AssignmentThreshold=35); end function filter = initFcn(detection, framerate, width, height, noiseIntensity) filter = initvisionbboxkf(detection, FrameRate = framerate, FrameSize = [width height],NoiseIntensity=noiseIntensity); filter.StateCovariance = diag([25 100 25 100 25 100 25 100]); end
In the run
function, implement the following procedure:
Detect people using the YOLO v4 object detector.
Apply non-maximum suppression (NMS) to reduce the number of candidate regions of interest (ROIs) and select only the strongest bounding boxes.
Filter bounding boxes based on their score by specifying the
ScoreThreshold
property to eliminate weaker detections.Update the tracker with the current frame's filtered bounding boxes.
Create new automated labels for the frame based off the updated tracks.
function automatedLabels = run(this,frame) % Detect people using YOLO v4 object detector [bboxes,scores,labels] = detect(this.Detector, frame, ... SelectStrongest=false, ... MaxSize = round([size(frame,1)/2, size(frame,2)/5])); % Apply non-maximum suppression to select the strongest bounding boxes. [selectedBboxes,selectedScores,selectedLabels] = selectStrongestBboxMulticlass(bboxes,scores,labels, ... RatioType = 'Min', ... OverlapThreshold = this.OverlapThreshold); isSelectedClass = selectedLabels == lower(this.SelectedLabelDefinitions.Name); % Consider only detections that meet specified score threshold % and are of the selected class label selectedBboxes = selectedBboxes(isSelectedClass & selectedScores > this.ScoreThreshold, :); selectedScores = selectedScores(isSelectedClass & selectedScores > this.ScoreThreshold); tracks = objectTrack.empty; if isLocked(this.Tracker) || ~isempty(selectedBboxes) % Convert to objectDetection detections = repmat(objectDetection(this.CurrentTime,[0 0 0 0]),1,size(selectedBboxes,1)); for i=1:numel(detections) detections(i).Measurement = selectedBboxes(i,:); detections(i).MeasurementNoise = (1/selectedScores(i))*25*eye(4); end [tracks, ~, alltracks, info] = this.Tracker(detections, this.CurrentTime); end if this.CurrentTime == this.StartTime && ~isempty(alltracks) % On the first frame, use tentative tracks states = [alltracks.State]; automatedLabels = struct(... 'Type', labelType.Rectangle,... 'Name', this.SelectedLabelDefinitions.Name,... 'Position',wrapPositionToFrame(states, frame) ,... 'Attributes',struct('ID',num2cell([alltracks.TrackID]))); elseif ~isempty(tracks) states = [tracks.State]; automatedLabels = struct(... 'Type', labelType.Rectangle,... 'Name', this.SelectedLabelDefinitions.Name,... 'Position',wrapPositionToFrame(states, frame) ,... 'Attributes',struct('ID',num2cell([tracks.TrackID]))); else automatedLabels = []; end end
Specify the MeasurementNoise
property of each object detection to capture the uncertainty of each measurement. The tracker models each bounding box using Gaussian probability densities. While a more accurate measurement noise based on the statistics of the object detector is possible, use a variance of 25 squared pixels as a good default. In addition, use the score of each bounding box to scale the noise variance up or down. A high score detection is more precise and should have a smaller noise value than a low score detection.
Open Video in Video Labeler
Download the video and open it using the Video Labeler app.
helperDownloadLabelVideo();
Downloading Pedestrian Tracking Video (90 MB)
videoLabeler("PedestrianLabelingVideo.avi");
Perform these steps to create a rectangular ROI label named Person with a numeric ID attribute.
Click Add Label in the Label Definition section of the app toolstrip.
Select the Rectangle ROI type.
Under Label Name, type Person. Choose a preferred color and click OK.
Select the Person ROI in the left ROI Labels panel and click on Attribute in the Label Definition section of the app toolstrip.
Select Numeric from the drop-down list of attributes.
Under Attribute Name, type ID, and click OK.
Expand the Person ROI in the left ROI Labels panel to display the fields shown below.
Import Automation Algorithm
Next, open the Select Algorithm drop down menu under the Automate Labeling section of the app toolstrip. The app can detect the helperObjectDetectorTracker
file located in the current example directory under +vision/+labeler. Click on Refresh and select the ObjectDetectorTracker option. This image shows the Automate Labeling section of the toolstrip that displays the name of the custom algorithm.
Run Automation Algorithm
Click on the Automate button in the Automate Labeling section of the toolstrip. Once in automation mode, click on Run to run the ObjectDetectorTracker
automation algorithm. Visualize the automated labeling run frame by frame. Once the algorithm has processed the entire video, verify the generated bounding box labels, as well as their ID attributes. Use this labeling workflow for object tracking or object re-identification to obtain unique and consistent identities for each person throughout the video.
The first frame does not contain any confirmed tracks. The tracking algorithm configured in this example requires two frames to confirm a track. When a detection is not assigned to an existing track, the algorithm uses this detection to initialize a new tentative track. The new tentative track becomes a confirmed track only if a detection can be assigned to it in the next frame. In this case, the first frame does not require manual labeling. Use the initialized tentative tracks to obtain labels in the first frame. In subsequent frames, because the tracking algorithm requires two frames to confirm a new track, a person entering the field of view of the camera will not be immediately labeled.
Verify and Refine Automation Results
Once the ObjectDetectorTracker
automation algorithm has completed running, review the quality of the ROI labels and confirm that each person has a single unique ID. First, return to the beginning of the video using the navigation pane below the video display. Zoom in on the group of standing individuals. Verify that each label instance has a unique ID. For example, the leftmost person has an ID of 4.
Verify automation algorithm results for correctness. Objects may have been missed or incorrect IDs may have been assigned because due to one or more of the following:
Objects are occluded or outside the image frame.
Objects are of too low resolution and the detector failed to identify them.
Objects are too closely spaced together.
Objects exhibit rapid changes in direction.
The detection and tracking algorithms generate bounding boxes with unique IDs across the videos. However, sometimes, bounding boxes can be missing entirely (false negatives). In more rare cases, bounding boxes with no visible person are maintained (false positive). Therefore, the labeling requires some manual refinement of boxes. To address this issue, add bounding boxes where false negatives exist and delete bounding boxes where false positives exist. Repair the unique IDs for identity switches and track fragmentations.
For example, the frame below at the 13 second mark, shows instances of:
Two false negatives: the person in the dark brown pants, as well as the person in the white pants.
Two identity switches: the selected ROI has its ID of 1 (previously assigned to the person in the dark brown pants) and one of the boxes around the leftmost individual has an ID of 6 (previously assigned to the second rightmost person in the frame).
Two false positives: the two boxes that are not correctly containing people in the frame.
Use the interactive capabilities of the Video Labeler app to add a new bounding box for the two false negatives. When adding a missing ROI, consider its ID value. Since in most cases, the person is tracked in a previous or future frame, use the existing ID for this person. In the above frame, assign an ID of 6 for the rightmost individual, because it is their ID in the first frame.
To repair multiple identity switches, you can choose from several methods. To minimize the amount of steps, consider the entire video. Some individuals, such as the two leftmost people in the frame above, have an ROI that remains mostly consistent throughout the video. In cases like this, we keep the original ID value and only need to repair a few frames for these individuals. However, as shown earlier, an ID switch occurred with the selected ROI in the image above from frame 1 to frame 2. Despite an initial difference on the first frame, keep the person in the blue sweater as ID 5 for efficiency, as this ID remains on the same person for the majority of the frames in the rest of the video.
To help accelerate the repair process for pedestrians who remain stationary from frame-to-frame or who are fairly constant in size throughout frames, copy and paste the ROI across frames. This approach is particularly useful for frames where the person was not properly detected from the YOLO v4 object detector. In the image frames below, some pedestrians are occluded by others. In some frames where a person should be labeled, they are not detected, and therefore, not labeled.
To address this issue, copy and paste the missing person's ROI into the frame from the last frame where the person was correctly detected with a bounding box. These partial occlusion ground truth images are valuable for training a robust ReID network. To learn more about addressing occlusions during training, see the Generate Training Data with Synthetic Object Occlusions section of the Reidentify People Throughout a Video Sequence Using ReID Network (Computer Vision Toolbox) example.
Export Ground Truth
Once the labeling is complete and each person is tracked across the entire video sequence with a unique identifier, this section shows you how to map all IDs to a contiguous sequence of integers after exporting the ground truth to MATLAB.
Click Accept in the Close section of the automation toolstrip once all refinements are done. Export the ground truth next. Click on Export > To Workspace in the app toolstrip.
The ground truth MAT-file contains the results of the tracking automation algorithm and the additional post-processing. You may skip the manual correction steps described above and directly load the ground truth to continue this example.
load("groundTruth.mat","gTruth");
Note that you can import the groundTruth.mat
file can back into the labeler by clicking on Import>Labels>From Workspace. Then select the gTruth
variable in the popup window.
The IDs assigned during automation track IDs using the trackerGNN
tracker. These track IDs can show a numerical jump in the automation due to unconfirmed tracks. The loaded groundTruth
object has IDs from 1 to 7, but then jumps to ID of 9, 13, and is followed by additional varying numerical steps. As noted, having a sequential set of IDs is a nice way of organizing ground truth data, use the helperSequentiallyRenumberIDs
function to adjust all of the ground truth IDs to sequential IDs.
gTruth = helperSequentiallyRenumberIDs(gTruth);
Next Steps
After sequentially ordering the ground truth, you can convert it for object tracking or for training a re-identification network. To learn more about how to convert ground truth for object tracking or training a ReID network, see the Convert Ground Truth Labeling Data for Object Tracking (Computer Vision Toolbox) and Convert Ground Truth Labeling Data for Object Re-Identification (Computer Vision Toolbox) examples.
Outlook
Labeling ground truth data for object tracking and re-identification is a challenging task. When the resolution of an object is high enough and minimal occlusions occur, you can use automation to accelerate labeling using automation algorithms such as ObjectDetectorTracker
. However, automation often needs additional refinement especially when poor resolution or object occlusion is present, so the results must be verified and corrected manually.
To improve the results of the automation algorithm, more robust multi-object tracking algorithms such as DeepSORT can be used. To learn more about multi-object tracking using DeepSORT, see the Multi-Object Tracking with DeepSORT example.
Supporting Functions
helperDownloadLabelVideo
Download the pedestrian labeling video.
function helperDownloadLabelVideo videoURL = "https://ssd.mathworks.com/supportfiles/vision/data/PedestrianLabelingVideo.avi"; if ~exist("PPedestrianLabelingVideo.avi","file") disp("Downloading Pedestrian Tracking Video (90 MB)") websave("PedestrianLabelingVideo.avi",videoURL); end end
helperSequentiallyRenumberIDs
Renumber each ID in the ground truth data to progress in a contiguous and sequential order.
function gTruth = helperSequentiallyRenumberIDs(gTruth) allLabels = struct2table(vertcat(gTruth.LabelData.Person{:})); oldIDs = unique(allLabels.ID); newIDs = cast(1:numel(oldIDs),'like',oldIDs); data = gTruth.LabelData.Person; for i = 1:numel(data) for id = 1:numel(oldIDs) oldID = oldIDs(id); ind = find([data{i}.ID] == oldID); if ~isempty(ind) if length(ind) > 1 error(['ID ' num2str(oldID) ' in video frame ' num2str(i) ' is not a unique ID.']); end data{i}(ind).ID = newIDs(id); end end end labelData = gTruth.LabelData; labelData.Person = data; gTruth = groundTruth(gTruth.DataSource, gTruth.LabelDefinitions, labelData); end