Human Pose Estimation by Using Segmentation DAG Network Deployed to FPGA
This example shows how to create, compile, and deploy a dlhdl.Workflow
object by using the Deep Learning HDL Toolbox™ Support Package for Xilinx® FPGA and SoC. The Workflow
object has a custom trained human pose estimation network as the network object. The network detects and outputs poses of people present in an input image of size 256-by-192. To train the network, see Estimate Body Pose Using Deep Learning.
The goal of body pose estimation is to identify the location of people in an image and the orientation of their body parts. When multiple people are present in a scene, pose estimation can be more difficult because of occlusion, body contact, and proximity of similar body parts. Rapidly prototype and verify the accuracy and performance of your custom trained human pose estimation network by using Deep Learning HDL Toolbox™ to deploy the network to your target FPGA board and using MATLAB® to retrieve the prediction results.
Prerequisites
Zynq® Ultrascale+™ MPSoC ZCU102 Evaluation Kit
Deep Learning HDL Toolbox™ Support Package for Xilinx® FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Load Pretrained Pose Estimation Network
To load the pretrained Directed Acyclic Graph (DAG) network, enter:
net = getPoseEstimationNetwork
Fetching PoseEstimationNetwork.zip (55 MB)... Fetching PoseEstimationNetwork.zip (55 MB)
net = DAGNetwork with properties: Layers: [75×1 nnet.cnn.layer.Layer] Connections: [82×2 table] InputNames: {'data'} OutputNames: {'RegressionLayer_conv15_fwd'}
Use the analyzeNetwork
function to obtain information about the 75 layers in the DAG network.
analyzeNetwork(net)
Create Target Object
Use the dlhdl.Target
class to create a target object that has a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG (default) and Ethernet. To use JTAG, install Xilinx® Vivado® Design Suite 2022.1. To set the Xilinx Vivado toolpath, enter:
% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2022.1\bin\vivado.bat');
hTarget = dlhdl.Target('Xilinx', Interface = 'Ethernet');
Create Workflow Object
Create an object of the dlhdl.Workflow
class. Specify the saved pretrained pose estimation network, net
, as the network object. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SoC board and the bitstream uses the single data type.
hW = dlhdl.Workflow(Network = net, Bitstream = 'zcu102_single', Target = hTarget);
Compile Workflow Object
To compile the Pose Estimation Network, run the compile function of the dlhdl.Workflow
object.
dn = compile(hW);
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single. ### The network includes the following layers: 1 'data' Image Input 256×192×3 images with 'zscore' normalization (SW Layer) 2 'conv1' Convolution 64 7×7×3 convolutions with stride [2 2] and padding [3 3 3 3] (HW Layer) 3 'bn_conv1' Batch Normalization Batch normalization with 64 channels (HW Layer) 4 'conv1_relu' ReLU ReLU (HW Layer) 5 'pool1' Max Pooling 3×3 max pooling with stride [2 2] and padding [1 1 1 1] (HW Layer) 6 'res2a_branch2a' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 7 'bn2a_branch2a' Batch Normalization Batch normalization with 64 channels (HW Layer) 8 'res2a_branch2a_relu' ReLU ReLU (HW Layer) 9 'res2a_branch2b' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 10 'bn2a_branch2b' Batch Normalization Batch normalization with 64 channels (HW Layer) 11 'res2a' Addition Element-wise addition of 2 inputs (HW Layer) 12 'res2a_relu' ReLU ReLU (HW Layer) 13 'res2b_branch2a' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 14 'bn2b_branch2a' Batch Normalization Batch normalization with 64 channels (HW Layer) 15 'res2b_branch2a_relu' ReLU ReLU (HW Layer) 16 'res2b_branch2b' Convolution 64 3×3×64 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 17 'bn2b_branch2b' Batch Normalization Batch normalization with 64 channels (HW Layer) 18 'res2b' Addition Element-wise addition of 2 inputs (HW Layer) 19 'res2b_relu' ReLU ReLU (HW Layer) 20 'res3a_branch2a' Convolution 128 3×3×64 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 21 'bn3a_branch2a' Batch Normalization Batch normalization with 128 channels (HW Layer) 22 'res3a_branch2a_relu' ReLU ReLU (HW Layer) 23 'res3a_branch2b' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 24 'bn3a_branch2b' Batch Normalization Batch normalization with 128 channels (HW Layer) 25 'res3a' Addition Element-wise addition of 2 inputs (HW Layer) 26 'res3a_relu' ReLU ReLU (HW Layer) 27 'res3a_branch1' Convolution 128 1×1×64 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 28 'bn3a_branch1' Batch Normalization Batch normalization with 128 channels (HW Layer) 29 'res3b_branch2a' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 30 'bn3b_branch2a' Batch Normalization Batch normalization with 128 channels (HW Layer) 31 'res3b_branch2a_relu' ReLU ReLU (HW Layer) 32 'res3b_branch2b' Convolution 128 3×3×128 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 33 'bn3b_branch2b' Batch Normalization Batch normalization with 128 channels (HW Layer) 34 'res3b' Addition Element-wise addition of 2 inputs (HW Layer) 35 'res3b_relu' ReLU ReLU (HW Layer) 36 'res4a_branch2a' Convolution 256 3×3×128 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 37 'bn4a_branch2a' Batch Normalization Batch normalization with 256 channels (HW Layer) 38 'res4a_branch2a_relu' ReLU ReLU (HW Layer) 39 'res4a_branch2b' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 40 'bn4a_branch2b' Batch Normalization Batch normalization with 256 channels (HW Layer) 41 'res4a' Addition Element-wise addition of 2 inputs (HW Layer) 42 'res4a_relu' ReLU ReLU (HW Layer) 43 'res4a_branch1' Convolution 256 1×1×128 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 44 'bn4a_branch1' Batch Normalization Batch normalization with 256 channels (HW Layer) 45 'res4b_branch2a' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 46 'bn4b_branch2a' Batch Normalization Batch normalization with 256 channels (HW Layer) 47 'res4b_branch2a_relu' ReLU ReLU (HW Layer) 48 'res4b_branch2b' Convolution 256 3×3×256 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 49 'bn4b_branch2b' Batch Normalization Batch normalization with 256 channels (HW Layer) 50 'res4b' Addition Element-wise addition of 2 inputs (HW Layer) 51 'res4b_relu' ReLU ReLU (HW Layer) 52 'res5a_branch2a' Convolution 512 3×3×256 convolutions with stride [2 2] and padding [1 1 1 1] (HW Layer) 53 'bn5a_branch2a' Batch Normalization Batch normalization with 512 channels (HW Layer) 54 'res5a_branch2a_relu' ReLU ReLU (HW Layer) 55 'res5a_branch2b' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 56 'bn5a_branch2b' Batch Normalization Batch normalization with 512 channels (HW Layer) 57 'res5a' Addition Element-wise addition of 2 inputs (HW Layer) 58 'res5a_relu' ReLU ReLU (HW Layer) 59 'res5a_branch1' Convolution 512 1×1×256 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 60 'bn5a_branch1' Batch Normalization Batch normalization with 512 channels (HW Layer) 61 'res5b_branch2a' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 62 'bn5b_branch2a' Batch Normalization Batch normalization with 512 channels (HW Layer) 63 'res5b_branch2a_relu' ReLU ReLU (HW Layer) 64 'res5b_branch2b' Convolution 512 3×3×512 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 65 'bn5b_branch2b' Batch Normalization Batch normalization with 512 channels (HW Layer) 66 'res5b' Addition Element-wise addition of 2 inputs (HW Layer) 67 'res5b_relu' ReLU ReLU (HW Layer) 68 'transposed-conv_1' Transposed Convolution 256 4×4×512 transposed convolutions with stride [2 2] and cropping 'same' (HW Layer) 69 'relu_1' ReLU ReLU (HW Layer) 70 'transposed-conv_2' Transposed Convolution 256 4×4×256 transposed convolutions with stride [2 2] and cropping 'same' (HW Layer) 71 'relu_2' ReLU ReLU (HW Layer) 72 'transposed-conv_3' Transposed Convolution 256 4×4×256 transposed convolutions with stride [2 2] and cropping 'same' (HW Layer) 73 'relu_3' ReLU ReLU (HW Layer) 74 'conv2d_final' Convolution 17 1×1×256 convolutions with stride [1 1] and padding [0 0 0 0] (HW Layer) 75 'RegressionLayer_conv15_fwd' Regression Output mean-squared-error (SW Layer) ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'transposed-conv_1' of type 'nnet.cnn.layer.TransposedConvolution2DLayer' is split into 'transposed-conv_1_insertZeros' and 'transposed-conv_1'. ### Notice: The layer 'transposed-conv_2' of type 'nnet.cnn.layer.TransposedConvolution2DLayer' is split into 'transposed-conv_2_insertZeros' and 'transposed-conv_2'. ### Notice: The layer 'transposed-conv_3' of type 'nnet.cnn.layer.TransposedConvolution2DLayer' is split into 'transposed-conv_3_insertZeros' and 'transposed-conv_3'. ### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization. ### Notice: The layer 'RegressionLayer_conv15_fwd' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. ### Compiling layer group: conv1>>pool1 ... ### Compiling layer group: conv1>>pool1 ... complete. ### Compiling layer group: res2a_branch2a>>res2a_branch2b ... ### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete. ### Compiling layer group: res2b_branch2a>>res2b_branch2b ... ### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete. ### Compiling layer group: res3a_branch1 ... ### Compiling layer group: res3a_branch1 ... complete. ### Compiling layer group: res3a_branch2a>>res3a_branch2b ... ### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete. ### Compiling layer group: res3b_branch2a>>res3b_branch2b ... ### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete. ### Compiling layer group: res4a_branch1 ... ### Compiling layer group: res4a_branch1 ... complete. ### Compiling layer group: res4a_branch2a>>res4a_branch2b ... ### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete. ### Compiling layer group: res4b_branch2a>>res4b_branch2b ... ### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete. ### Compiling layer group: res5a_branch1 ... ### Compiling layer group: res5a_branch1 ... complete. ### Compiling layer group: res5a_branch2a>>res5a_branch2b ... ### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete. ### Compiling layer group: res5b_branch2a>>res5b_branch2b ... ### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete. ### Compiling layer group: transposed-conv_1_insertZeros ... ### Compiling layer group: transposed-conv_1_insertZeros ... complete. ### Compiling layer group: transposed-conv_1>>relu_1 ... ### Compiling layer group: transposed-conv_1>>relu_1 ... complete. ### Compiling layer group: transposed-conv_2_insertZeros ... ### Compiling layer group: transposed-conv_2_insertZeros ... complete. ### Compiling layer group: transposed-conv_2>>relu_2 ... ### Compiling layer group: transposed-conv_2>>relu_2 ... complete. ### Compiling layer group: transposed-conv_3_insertZeros ... ### Compiling layer group: transposed-conv_3_insertZeros ... complete. ### Compiling layer group: transposed-conv_3>>conv2d_final ... ### Compiling layer group: transposed-conv_3>>conv2d_final ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ _________________ "InputDataOffset" "0x00000000" "24.0 MB" "OutputResultOffset" "0x01800000" "8.0 MB" "SchedulerDataOffset" "0x02000000" "8.0 MB" "SystemBufferOffset" "0x02800000" "28.0 MB" "InstructionDataOffset" "0x04400000" "8.0 MB" "ConvWeightDataOffset" "0x04c00000" "220.0 MB" "EndOffset" "0x12800000" "Total: 296.0 MB" ### Network compilation complete.
Program Bitstream onto FPGA and Download Network Weights
To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow
object. This function uses the output of the compile function to program the FPGA board by using the programming file. The function also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.
deploy(hW)
### Programming FPGA Bitstream using Ethernet... Downloading target FPGA device configuration over Ethernet to SD card ... # Copied /tmp/hdlcoder_rd to /mnt/hdlcoder_rd # Copying Bitstream hdlcoder_system.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/hdlcoder_system.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' Downloading target FPGA device configuration over Ethernet to SD card done. The system will now reboot for persistent changes to take effect. System is rebooting . . . . . . ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 19-Jan-2022 20:13:32
Load Test Image
Read a test image, then crop an image of a person and resize it to the network input size:
I = imread('visionteam1.jpg');
bbox = [182 74 303 404];
Iin = imresize(imcrop(I, bbox), [256, 192]);
Run Prediction for One Image
Execute the predict function of the dlhdl.Workflow
object.
[prediction, speed] = predict(hW, single(Iin), Profile = 'on');
### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 106379104 0.48354 1 106382160 2.1 data_norm_add 344327 0.00157 data_norm 344408 0.00157 conv1 2193504 0.00997 pool1 518554 0.00236 res2a_branch2a 961197 0.00437 res2a_branch2b 960769 0.00437 res2a 366754 0.00167 res2b_branch2a 961107 0.00437 res2b_branch2b 960940 0.00437 res2b 366715 0.00167 res3a_branch1 549086 0.00250 res3a_branch2a 542269 0.00246 res3a_branch2b 894520 0.00407 res3a 183362 0.00083 res3b_branch2a 894609 0.00407 res3b_branch2b 894473 0.00407 res3b 183403 0.00083 res4a_branch1 485003 0.00220 res4a_branch2a 485309 0.00221 res4a_branch2b 877978 0.00399 res4a 91703 0.00042 res4b_branch2a 878002 0.00399 res4b_branch2b 878177 0.00399 res4b 91743 0.00042 res5a_branch1 1063237 0.00483 res5a_branch2a 1063292 0.00483 res5a_branch2b 2064743 0.00939 res5a 45904 0.00021 res5b_branch2a 2064047 0.00938 res5b_branch2b 2064894 0.00939 res5b 45894 0.00021 transposed-conv_1_insertZeros 219876 0.00100 transposed-conv_1 6587071 0.02994 transposed-conv_2_insertZeros 261960 0.00119 transposed-conv_2 16585251 0.07539 transposed-conv_3_insertZeros 1058301 0.00481 transposed-conv_3 55919081 0.25418 conv2d_final 1427387 0.00649 * The clock frequency of the DL processor is: 220MHz
The output data has 17 channels. Each channel corresponds to a heatmap for a unique body part. To obtain keypoints from the heatmaps, use heatmaps2Keypoints
helper function.
To visualize the results, superimpose the detected keypoints on the original image by using the visualizeKeyPoints
helper function. The functions are attached to the example as supporting files.
keypoints = heatmaps2Keypoints(prediction); J = visualizeKeyPoints(Iin, keypoints); imshow(J);
See Also
dlhdl.Target
| dlhdl.Workflow
| compile
| deploy
| predict
| classify