Human Pose Estimation by Using Segmentation DAG Network Deployed to FPGA

This example uses:

Deep Learning HDL Toolbox Deep Learning HDL Toolbox
Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices
Deep Learning Toolbox Deep Learning Toolbox

This example shows how to create, compile, and deploy a dlhdl.Workflow object by using the Deep Learning HDL Toolbox™ Support Package for Xilinx® FPGA and SoC. The Workflow object has a custom trained human pose estimation network as the network object. The network detects and outputs poses of people present in an input image of size 256-by-192. To train the network, see Estimate Body Pose Using Deep Learning.

The goal of body pose estimation is to identify the location of people in an image and the orientation of their body parts. When multiple people are present in a scene, pose estimation can be more difficult because of occlusion, body contact, and proximity of similar body parts. Rapidly prototype and verify the accuracy and performance of your custom trained human pose estimation network by using Deep Learning HDL Toolbox™ to deploy the network to your target FPGA board and using MATLAB® to retrieve the prediction results.

Prerequisites

Zynq® Ultrascale+™ MPSoC ZCU102 Evaluation Kit
Deep Learning HDL Toolbox™ Support Package for Xilinx® FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™

Load Pretrained Pose Estimation Network

To load the pretrained Directed Acyclic Graph (DAG) network, enter:

net = getPoseEstimationNetwork

Fetching PoseEstimationNetwork.zip (55 MB)...
Fetching PoseEstimationNetwork.zip (55 MB)

net = 
  DAGNetwork with properties:

         Layers: [75×1 nnet.cnn.layer.Layer]
    Connections: [82×2 table]
     InputNames: {'data'}
    OutputNames: {'RegressionLayer_conv15_fwd'}

Use the analyzeNetwork function to obtain information about the 75 layers in the DAG network.

analyzeNetwork(net)

Create Target Object

Use the dlhdl.Target class to create a target object that has a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG (default) and Ethernet. To use JTAG, install Xilinx® Vivado® Design Suite 2022.1. To set the Xilinx Vivado toolpath, enter:

% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2022.1\bin\vivado.bat');

hTarget = dlhdl.Target('Xilinx', Interface = 'Ethernet');

Create Workflow Object

Create an object of the dlhdl.Workflow class. Specify the saved pretrained pose estimation network, net, as the network object. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SoC board and the bitstream uses the single data type.

hW = dlhdl.Workflow(Network = net, Bitstream = 'zcu102_single', Target = hTarget);

Compile Workflow Object

To compile the Pose Estimation Network, run the compile function of the dlhdl.Workflow object.

dn = compile(hW);

### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### The network includes the following layers:
     1   'data'                         Image Input              256×192×3 images with 'zscore' normalization                                (SW Layer)
     2   'conv1'                        Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]           (HW Layer)
     3   'bn_conv1'                     Batch Normalization      Batch normalization with 64 channels                                        (HW Layer)
     4   'conv1_relu'                   ReLU                     ReLU                                                                        (HW Layer)
     5   'pool1'                        Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]                 (HW Layer)
     6   'res2a_branch2a'               Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]          (HW Layer)
     7   'bn2a_branch2a'                Batch Normalization      Batch normalization with 64 channels                                        (HW Layer)
     8   'res2a_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
     9   'res2a_branch2b'               Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]          (HW Layer)
    10   'bn2a_branch2b'                Batch Normalization      Batch normalization with 64 channels                                        (HW Layer)
    11   'res2a'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    12   'res2a_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    13   'res2b_branch2a'               Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]          (HW Layer)
    14   'bn2b_branch2a'                Batch Normalization      Batch normalization with 64 channels                                        (HW Layer)
    15   'res2b_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
    16   'res2b_branch2b'               Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]          (HW Layer)
    17   'bn2b_branch2b'                Batch Normalization      Batch normalization with 64 channels                                        (HW Layer)
    18   'res2b'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    19   'res2b_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    20   'res3a_branch2a'               Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]         (HW Layer)
    21   'bn3a_branch2a'                Batch Normalization      Batch normalization with 128 channels                                       (HW Layer)
    22   'res3a_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
    23   'res3a_branch2b'               Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    24   'bn3a_branch2b'                Batch Normalization      Batch normalization with 128 channels                                       (HW Layer)
    25   'res3a'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    26   'res3a_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    27   'res3a_branch1'                Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]         (HW Layer)
    28   'bn3a_branch1'                 Batch Normalization      Batch normalization with 128 channels                                       (HW Layer)
    29   'res3b_branch2a'               Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    30   'bn3b_branch2a'                Batch Normalization      Batch normalization with 128 channels                                       (HW Layer)
    31   'res3b_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
    32   'res3b_branch2b'               Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    33   'bn3b_branch2b'                Batch Normalization      Batch normalization with 128 channels                                       (HW Layer)
    34   'res3b'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    35   'res3b_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    36   'res4a_branch2a'               Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]        (HW Layer)
    37   'bn4a_branch2a'                Batch Normalization      Batch normalization with 256 channels                                       (HW Layer)
    38   'res4a_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
    39   'res4a_branch2b'               Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    40   'bn4a_branch2b'                Batch Normalization      Batch normalization with 256 channels                                       (HW Layer)
    41   'res4a'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    42   'res4a_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    43   'res4a_branch1'                Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]        (HW Layer)
    44   'bn4a_branch1'                 Batch Normalization      Batch normalization with 256 channels                                       (HW Layer)
    45   'res4b_branch2a'               Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    46   'bn4b_branch2a'                Batch Normalization      Batch normalization with 256 channels                                       (HW Layer)
    47   'res4b_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
    48   'res4b_branch2b'               Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    49   'bn4b_branch2b'                Batch Normalization      Batch normalization with 256 channels                                       (HW Layer)
    50   'res4b'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    51   'res4b_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    52   'res5a_branch2a'               Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]        (HW Layer)
    53   'bn5a_branch2a'                Batch Normalization      Batch normalization with 512 channels                                       (HW Layer)
    54   'res5a_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
    55   'res5a_branch2b'               Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    56   'bn5a_branch2b'                Batch Normalization      Batch normalization with 512 channels                                       (HW Layer)
    57   'res5a'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    58   'res5a_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    59   'res5a_branch1'                Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]        (HW Layer)
    60   'bn5a_branch1'                 Batch Normalization      Batch normalization with 512 channels                                       (HW Layer)
    61   'res5b_branch2a'               Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    62   'bn5b_branch2a'                Batch Normalization      Batch normalization with 512 channels                                       (HW Layer)
    63   'res5b_branch2a_relu'          ReLU                     ReLU                                                                        (HW Layer)
    64   'res5b_branch2b'               Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]        (HW Layer)
    65   'bn5b_branch2b'                Batch Normalization      Batch normalization with 512 channels                                       (HW Layer)
    66   'res5b'                        Addition                 Element-wise addition of 2 inputs                                           (HW Layer)
    67   'res5b_relu'                   ReLU                     ReLU                                                                        (HW Layer)
    68   'transposed-conv_1'            Transposed Convolution   256 4×4×512 transposed convolutions with stride [2  2] and cropping 'same'  (HW Layer)
    69   'relu_1'                       ReLU                     ReLU                                                                        (HW Layer)
    70   'transposed-conv_2'            Transposed Convolution   256 4×4×256 transposed convolutions with stride [2  2] and cropping 'same'  (HW Layer)
    71   'relu_2'                       ReLU                     ReLU                                                                        (HW Layer)
    72   'transposed-conv_3'            Transposed Convolution   256 4×4×256 transposed convolutions with stride [2  2] and cropping 'same'  (HW Layer)
    73   'relu_3'                       ReLU                     ReLU                                                                        (HW Layer)
    74   'conv2d_final'                 Convolution              17 1×1×256 convolutions with stride [1  1] and padding [0  0  0  0]         (HW Layer)
    75   'RegressionLayer_conv15_fwd'   Regression Output        mean-squared-error                                                          (SW Layer)
                                                                                                                                           
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'transposed-conv_1' of type 'nnet.cnn.layer.TransposedConvolution2DLayer' is split into 'transposed-conv_1_insertZeros' and 'transposed-conv_1'.
### Notice: The layer 'transposed-conv_2' of type 'nnet.cnn.layer.TransposedConvolution2DLayer' is split into 'transposed-conv_2_insertZeros' and 'transposed-conv_2'.
### Notice: The layer 'transposed-conv_3' of type 'nnet.cnn.layer.TransposedConvolution2DLayer' is split into 'transposed-conv_3_insertZeros' and 'transposed-conv_3'.
### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization.
### Notice: The layer 'RegressionLayer_conv15_fwd' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Compiling layer group: conv1>>pool1 ...
### Compiling layer group: conv1>>pool1 ... complete.
### Compiling layer group: res2a_branch2a>>res2a_branch2b ...
### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.
### Compiling layer group: res2b_branch2a>>res2b_branch2b ...
### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.
### Compiling layer group: res3a_branch1 ...
### Compiling layer group: res3a_branch1 ... complete.
### Compiling layer group: res3a_branch2a>>res3a_branch2b ...
### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.
### Compiling layer group: res3b_branch2a>>res3b_branch2b ...
### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.
### Compiling layer group: res4a_branch1 ...
### Compiling layer group: res4a_branch1 ... complete.
### Compiling layer group: res4a_branch2a>>res4a_branch2b ...
### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.
### Compiling layer group: res4b_branch2a>>res4b_branch2b ...
### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.
### Compiling layer group: res5a_branch1 ...
### Compiling layer group: res5a_branch1 ... complete.
### Compiling layer group: res5a_branch2a>>res5a_branch2b ...
### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.
### Compiling layer group: res5b_branch2a>>res5b_branch2b ...
### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.
### Compiling layer group: transposed-conv_1_insertZeros ...
### Compiling layer group: transposed-conv_1_insertZeros ... complete.
### Compiling layer group: transposed-conv_1>>relu_1 ...
### Compiling layer group: transposed-conv_1>>relu_1 ... complete.
### Compiling layer group: transposed-conv_2_insertZeros ...
### Compiling layer group: transposed-conv_2_insertZeros ... complete.
### Compiling layer group: transposed-conv_2>>relu_2 ...
### Compiling layer group: transposed-conv_2>>relu_2 ... complete.
### Compiling layer group: transposed-conv_3_insertZeros ...
### Compiling layer group: transposed-conv_3_insertZeros ... complete.
### Compiling layer group: transposed-conv_3>>conv2d_final ...
### Compiling layer group: transposed-conv_3>>conv2d_final ... complete.

### Allocating external memory buffers:

          offset_name          offset_address     allocated_space 
    _______________________    ______________    _________________

    "InputDataOffset"           "0x00000000"     "24.0 MB"        
    "OutputResultOffset"        "0x01800000"     "8.0 MB"         
    "SchedulerDataOffset"       "0x02000000"     "8.0 MB"         
    "SystemBufferOffset"        "0x02800000"     "28.0 MB"        
    "InstructionDataOffset"     "0x04400000"     "8.0 MB"         
    "ConvWeightDataOffset"      "0x04c00000"     "220.0 MB"       
    "EndOffset"                 "0x12800000"     "Total: 296.0 MB"

### Network compilation complete.

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. The function also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.

deploy(hW)

### Programming FPGA Bitstream using Ethernet...
Downloading target FPGA device configuration over Ethernet to SD card ...
# Copied /tmp/hdlcoder_rd to /mnt/hdlcoder_rd
# Copying Bitstream hdlcoder_system.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/hdlcoder_system.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'

Downloading target FPGA device configuration over Ethernet to SD card done. The system will now reboot for persistent changes to take effect.


System is rebooting . . . . . .
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 19-Jan-2022 20:13:32

Load Test Image

Read a test image, then crop an image of a person and resize it to the network input size:

I = imread('visionteam1.jpg');
bbox = [182 74 303 404];
Iin = imresize(imcrop(I, bbox), [256, 192]);

Run Prediction for One Image

Execute the predict function of the dlhdl.Workflow object.

[prediction, speed] = predict(hW, single(Iin), Profile = 'on');

### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                  106379104                  0.48354                       1          106382160              2.1
    data_norm_add           344327                  0.00157 
    data_norm               344408                  0.00157 
    conv1                  2193504                  0.00997 
    pool1                   518554                  0.00236 
    res2a_branch2a          961197                  0.00437 
    res2a_branch2b          960769                  0.00437 
    res2a                   366754                  0.00167 
    res2b_branch2a          961107                  0.00437 
    res2b_branch2b          960940                  0.00437 
    res2b                   366715                  0.00167 
    res3a_branch1           549086                  0.00250 
    res3a_branch2a          542269                  0.00246 
    res3a_branch2b          894520                  0.00407 
    res3a                   183362                  0.00083 
    res3b_branch2a          894609                  0.00407 
    res3b_branch2b          894473                  0.00407 
    res3b                   183403                  0.00083 
    res4a_branch1           485003                  0.00220 
    res4a_branch2a          485309                  0.00221 
    res4a_branch2b          877978                  0.00399 
    res4a                    91703                  0.00042 
    res4b_branch2a          878002                  0.00399 
    res4b_branch2b          878177                  0.00399 
    res4b                    91743                  0.00042 
    res5a_branch1          1063237                  0.00483 
    res5a_branch2a         1063292                  0.00483 
    res5a_branch2b         2064743                  0.00939 
    res5a                    45904                  0.00021 
    res5b_branch2a         2064047                  0.00938 
    res5b_branch2b         2064894                  0.00939 
    res5b                    45894                  0.00021 
    transposed-conv_1_insertZeros    219876                  0.00100 
    transposed-conv_1      6587071                  0.02994 
    transposed-conv_2_insertZeros    261960                  0.00119 
    transposed-conv_2     16585251                  0.07539 
    transposed-conv_3_insertZeros   1058301                  0.00481 
    transposed-conv_3     55919081                  0.25418 
    conv2d_final           1427387                  0.00649 
 * The clock frequency of the DL processor is: 220MHz

The output data has 17 channels. Each channel corresponds to a heatmap for a unique body part. To obtain keypoints from the heatmaps, use heatmaps2Keypoints helper function. To visualize the results, superimpose the detected keypoints on the original image by using the visualizeKeyPoints helper function. The functions are attached to the example as supporting files.

keypoints = heatmaps2Keypoints(prediction);
J = visualizeKeyPoints(Iin, keypoints);
imshow(J);