Deploy Image Recognition Network on FPGA with and Without Pruning
This example shows you how to deploy an image recognition network with and without convolutional filter pruning. Filter pruning is a compression technique that uses some criterion to identify and remove the least important filters in a network, which reduces the overall memory footprint of the network without significantly reducing the network accuracy.
Load Unpruned Network
Load the unpruned trained network. For information on network training, see Train Residual Network for Image Classification.
load("trainedYOLONet.mat");
Test Network
Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.
load("testImage.mat");
Use the runonHW
function to:
Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.
To view the code for this function, see Helper Functions.
[~, speedInitial] = runOnHW(trainedNet,testImage,'zcu102_single');
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware. ### The network includes the following layers: 1 'input' Image Input 32×32×3 images with 'zerocenter' normalization (SW Layer) 2 'convInp' 2-D Convolution 16 3×3×3 convolutions with stride [1 1] and padding 'same' (HW Layer) 3 'reluInp' ReLU ReLU (HW Layer) 4 'S1U1_conv1' 2-D Convolution 16 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 5 'S1U1_relu1' ReLU ReLU (HW Layer) 6 'S1U1_conv2' 2-D Convolution 16 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 7 'add11' Addition Element-wise addition of 2 inputs (HW Layer) 8 'relu11' ReLU ReLU (HW Layer) 9 'S1U2_conv1' 2-D Convolution 16 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 10 'S1U2_relu1' ReLU ReLU (HW Layer) 11 'S1U2_conv2' 2-D Convolution 16 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 12 'add12' Addition Element-wise addition of 2 inputs (HW Layer) 13 'relu12' ReLU ReLU (HW Layer) 14 'S1U3_conv1' 2-D Convolution 16 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 15 'S1U3_relu1' ReLU ReLU (HW Layer) 16 'S1U3_conv2' 2-D Convolution 16 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 17 'add13' Addition Element-wise addition of 2 inputs (HW Layer) 18 'relu13' ReLU ReLU (HW Layer) 19 'S2U1_conv1' 2-D Convolution 32 3×3×16 convolutions with stride [2 2] and padding 'same' (HW Layer) 20 'S2U1_relu1' ReLU ReLU (HW Layer) 21 'S2U1_conv2' 2-D Convolution 32 3×3×32 convolutions with stride [1 1] and padding 'same' (HW Layer) 22 'skipConv1' 2-D Convolution 32 1×1×16 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 23 'add21' Addition Element-wise addition of 2 inputs (HW Layer) 24 'relu21' ReLU ReLU (HW Layer) 25 'S2U2_conv1' 2-D Convolution 32 3×3×32 convolutions with stride [1 1] and padding 'same' (HW Layer) 26 'S2U2_relu1' ReLU ReLU (HW Layer) 27 'S2U2_conv2' 2-D Convolution 32 3×3×32 convolutions with stride [1 1] and padding 'same' (HW Layer) 28 'add22' Addition Element-wise addition of 2 inputs (HW Layer) 29 'relu22' ReLU ReLU (HW Layer) 30 'S2U3_conv1' 2-D Convolution 32 3×3×32 convolutions with stride [1 1] and padding 'same' (HW Layer) 31 'S2U3_relu1' ReLU ReLU (HW Layer) 32 'S2U3_conv2' 2-D Convolution 32 3×3×32 convolutions with stride [1 1] and padding 'same' (HW Layer) 33 'add23' Addition Element-wise addition of 2 inputs (HW Layer) 34 'relu23' ReLU ReLU (HW Layer) 35 'S3U1_conv1' 2-D Convolution 64 3×3×32 convolutions with stride [2 2] and padding 'same' (HW Layer) 36 'S3U1_relu1' ReLU ReLU (HW Layer) 37 'S3U1_conv2' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding 'same' (HW Layer) 38 'skipConv2' 2-D Convolution 64 1×1×32 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 39 'add31' Addition Element-wise addition of 2 inputs (HW Layer) 40 'relu31' ReLU ReLU (HW Layer) 41 'S3U2_conv1' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding 'same' (HW Layer) 42 'S3U2_relu1' ReLU ReLU (HW Layer) 43 'S3U2_conv2' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding 'same' (HW Layer) 44 'add32' Addition Element-wise addition of 2 inputs (HW Layer) 45 'relu32' ReLU ReLU (HW Layer) 46 'S3U3_conv1' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding 'same' (HW Layer) 47 'S3U3_relu1' ReLU ReLU (HW Layer) 48 'S3U3_conv2' 2-D Convolution 64 3×3×64 convolutions with stride [1 1] and padding 'same' (HW Layer) 49 'add33' Addition Element-wise addition of 2 inputs (HW Layer) 50 'relu33' ReLU ReLU (HW Layer) 51 'globalPool' 2-D Average Pooling 8×8 average pooling with stride [1 1] and padding [0 0 0 0] (HW Layer) 52 'fcFinal' Fully Connected 10 fully connected layer (HW Layer) 53 'softmax' Softmax softmax (SW Layer) 54 'classoutput' Classification Output crossentropyex with 'airplane' and 9 other classes (SW Layer) ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. ### Compiling layer group: convInp>>reluInp ... ### Compiling layer group: convInp>>reluInp ... complete. ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete. ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete. ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete. ### Compiling layer group: skipConv1 ... ### Compiling layer group: skipConv1 ... complete. ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete. ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete. ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete. ### Compiling layer group: skipConv2 ... ### Compiling layer group: skipConv2 ... complete. ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete. ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete. ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete. ### Compiling layer group: globalPool ... ### Compiling layer group: globalPool ... complete. ### Compiling layer group: fcFinal ... ### Compiling layer group: fcFinal ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ ________________ "InputDataOffset" "0x00000000" "4.0 MB" "OutputResultOffset" "0x00400000" "4.0 MB" "SchedulerDataOffset" "0x00800000" "4.0 MB" "SystemBufferOffset" "0x00c00000" "28.0 MB" "InstructionDataOffset" "0x02800000" "4.0 MB" "ConvWeightDataOffset" "0x02c00000" "4.0 MB" "FCWeightDataOffset" "0x03000000" "4.0 MB" "EndOffset" "0x03400000" "Total: 52.0 MB" ### Network compilation complete. ### Programming FPGA Bitstream using Ethernet... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101... ### Copying FPGA programming files to SD card... ### Setting FPGA bitstream and devicetree for boot... # Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/zcu102_single.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' ### Rebooting Xilinx SoC at 192.168.1.101... ### Reboot may take several seconds... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 07-Mar-2023 11:23:24 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 07-Mar-2023 11:23:24 ### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 820446 0.00373 1 823069 267.3 input_norm 7334 0.00003 convInp 14042 0.00006 S1U1_conv1 32046 0.00015 S1U1_conv2 32198 0.00015 add11 30643 0.00014 S1U2_conv1 32428 0.00015 S1U2_conv2 32212 0.00015 add12 30553 0.00014 S1U3_conv1 32074 0.00015 S1U3_conv2 32289 0.00015 add13 30553 0.00014 skipConv1 20674 0.00009 S2U1_conv1 21193 0.00010 S2U1_conv2 26334 0.00012 add21 15373 0.00007 S2U2_conv1 26655 0.00012 S2U2_conv2 26481 0.00012 add22 15353 0.00007 S2U3_conv1 26614 0.00012 S2U3_conv2 26584 0.00012 add23 15313 0.00007 skipConv2 25361 0.00012 S3U1_conv1 24950 0.00011 S3U1_conv2 41437 0.00019 add31 7714 0.00004 S3U2_conv1 41695 0.00019 S3U2_conv2 41679 0.00019 add32 7827 0.00004 S3U3_conv1 41513 0.00019 S3U3_conv2 42203 0.00019 add33 7764 0.00004 globalPool 10197 0.00005 fcFinal 973 0.00000 * The clock frequency of the DL processor is: 220MHz
Load Pruned Network
Load the trained, pruned network. For more information on network training, see Prune Image Classification Network Using Taylor Scores.
load("prunedDAGNet.mat");
Test Network
Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.
load("testImage.mat");
Use the runonHW
function to:
Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.
To view the code for this function, see Helper Functions.
[~, speedPruned] = runOnHW(prunedDAGNet,testImage,'zcu102_single');
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware. ### The network includes the following layers: 1 'input' Image Input 32×32×3 images with 'zerocenter' normalization (SW Layer) 2 'convInp' 2-D Convolution 16 3×3×3 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 3 'reluInp' ReLU ReLU (HW Layer) 4 'S1U1_conv1' 2-D Convolution 5 3×3×16 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 5 'S1U1_relu1' ReLU ReLU (HW Layer) 6 'S1U1_conv2' 2-D Convolution 16 3×3×5 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 7 'add11' Addition Element-wise addition of 2 inputs (HW Layer) 8 'relu11' ReLU ReLU (HW Layer) 9 'S1U2_conv1' 2-D Convolution 8 3×3×16 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 10 'S1U2_relu1' ReLU ReLU (HW Layer) 11 'S1U2_conv2' 2-D Convolution 16 3×3×8 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 12 'add12' Addition Element-wise addition of 2 inputs (HW Layer) 13 'relu12' ReLU ReLU (HW Layer) 14 'S1U3_conv1' 2-D Convolution 14 3×3×16 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 15 'S1U3_relu1' ReLU ReLU (HW Layer) 16 'S1U3_conv2' 2-D Convolution 16 3×3×14 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 17 'add13' Addition Element-wise addition of 2 inputs (HW Layer) 18 'relu13' ReLU ReLU (HW Layer) 19 'S2U1_conv1' 2-D Convolution 22 3×3×16 convolutions with stride [2 2] and padding [0 1 0 1] (HW Layer) 20 'S2U1_relu1' ReLU ReLU (HW Layer) 21 'S2U1_conv2' 2-D Convolution 27 3×3×22 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 22 'skipConv1' 2-D Convolution 27 1×1×16 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 23 'add21' Addition Element-wise addition of 2 inputs (HW Layer) 24 'relu21' ReLU ReLU (HW Layer) 25 'S2U2_conv1' 2-D Convolution 30 3×3×27 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 26 'S2U2_relu1' ReLU ReLU (HW Layer) 27 'S2U2_conv2' 2-D Convolution 27 3×3×30 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 28 'add22' Addition Element-wise addition of 2 inputs (HW Layer) 29 'relu22' ReLU ReLU (HW Layer) 30 'S2U3_conv1' 2-D Convolution 26 3×3×27 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 31 'S2U3_relu1' ReLU ReLU (HW Layer) 32 'S2U3_conv2' 2-D Convolution 27 3×3×26 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 33 'add23' Addition Element-wise addition of 2 inputs (HW Layer) 34 'relu23' ReLU ReLU (HW Layer) 35 'S3U1_conv1' 2-D Convolution 37 3×3×27 convolutions with stride [2 2] and padding [0 1 0 1] (HW Layer) 36 'S3U1_relu1' ReLU ReLU (HW Layer) 37 'S3U1_conv2' 2-D Convolution 39 3×3×37 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 38 'skipConv2' 2-D Convolution 39 1×1×27 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 39 'add31' Addition Element-wise addition of 2 inputs (HW Layer) 40 'relu31' ReLU ReLU (HW Layer) 41 'S3U2_conv1' 2-D Convolution 38 3×3×39 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 42 'S3U2_relu1' ReLU ReLU (HW Layer) 43 'S3U2_conv2' 2-D Convolution 39 3×3×38 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 44 'add32' Addition Element-wise addition of 2 inputs (HW Layer) 45 'relu32' ReLU ReLU (HW Layer) 46 'S3U3_conv1' 2-D Convolution 36 3×3×39 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 47 'S3U3_relu1' ReLU ReLU (HW Layer) 48 'S3U3_conv2' 2-D Convolution 39 3×3×36 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 49 'add33' Addition Element-wise addition of 2 inputs (HW Layer) 50 'relu33' ReLU ReLU (HW Layer) 51 'globalPool' 2-D Average Pooling 8×8 average pooling with stride [1 1] and padding [0 0 0 0] (HW Layer) 52 'fcFinal' Fully Connected 10 fully connected layer (HW Layer) 53 'softmax' Softmax softmax (SW Layer) 54 'classoutput' Classification Output crossentropyex with 'airplane' and 9 other classes (SW Layer) ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. ### Compiling layer group: convInp>>reluInp ... ### Compiling layer group: convInp>>reluInp ... complete. ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete. ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete. ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete. ### Compiling layer group: skipConv1 ... ### Compiling layer group: skipConv1 ... complete. ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete. ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete. ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete. ### Compiling layer group: skipConv2 ... ### Compiling layer group: skipConv2 ... complete. ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete. ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete. ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete. ### Compiling layer group: globalPool ... ### Compiling layer group: globalPool ... complete. ### Compiling layer group: fcFinal ... ### Compiling layer group: fcFinal ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ ________________ "InputDataOffset" "0x00000000" "4.0 MB" "OutputResultOffset" "0x00400000" "4.0 MB" "SchedulerDataOffset" "0x00800000" "4.0 MB" "SystemBufferOffset" "0x00c00000" "28.0 MB" "InstructionDataOffset" "0x02800000" "4.0 MB" "ConvWeightDataOffset" "0x02c00000" "4.0 MB" "FCWeightDataOffset" "0x03000000" "4.0 MB" "EndOffset" "0x03400000" "Total: 52.0 MB" ### Network compilation complete. ### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 07-Mar-2023 11:24:09 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 07-Mar-2023 11:24:09 ### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 587863 0.00267 1 590483 372.6 input_norm 7266 0.00003 convInp 14102 0.00006 S1U1_conv1 20170 0.00009 S1U1_conv2 20248 0.00009 add11 30471 0.00014 S1U2_conv1 20486 0.00009 S1U2_conv2 20079 0.00009 add12 30656 0.00014 S1U3_conv1 32404 0.00015 S1U3_conv2 31891 0.00014 add13 30563 0.00014 skipConv1 19154 0.00009 S2U1_conv1 17965 0.00008 S2U1_conv2 18679 0.00008 add21 13442 0.00006 S2U2_conv1 23890 0.00011 S2U2_conv2 24006 0.00011 add22 13462 0.00006 S2U3_conv1 21638 0.00010 S2U3_conv2 21691 0.00010 add23 13472 0.00006 skipConv2 15603 0.00007 S3U1_conv1 16138 0.00007 S3U1_conv2 18238 0.00008 add31 4850 0.00002 S3U2_conv1 17971 0.00008 S3U2_conv2 18210 0.00008 add32 4830 0.00002 S3U3_conv1 16631 0.00008 S3U3_conv2 17296 0.00008 add33 4760 0.00002 globalPool 6576 0.00003 fcFinal 838 0.00000 * The clock frequency of the DL processor is: 220MHz
Quantize Pruned Network
You can quantize the pruned network to obtain an improved performance.
Create an augmentedImageDataStore
object to store the training images.
imds = augmentedImageDatastore([32,32],testImage);
Create a dlquantizer
object.
dlqObj = dlquantizer(prunedDAGNet, ExecutionEnvironment="FPGA");
Calibrate the dlquantizer
object using the training images.
calibrate(dlqObj,imds)
ans=100×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue MaxValue
______________________ __________________ ________________________ __________ _________
{'convInp_Weights' } {'convInp' } "Weights" -0.0060522 0.0076182
{'convInp_Bias' } {'convInp' } "Bias" -0.23065 0.79941
{'S1U1_conv1_Weights'} {'S1U1_conv1'} "Weights" -0.36637 0.37601
{'S1U1_conv1_Bias' } {'S1U1_conv1'} "Bias" 0.076761 0.79494
{'S1U1_conv2_Weights'} {'S1U1_conv2'} "Weights" -0.8197 0.54487
{'S1U1_conv2_Bias' } {'S1U1_conv2'} "Bias" -0.27783 0.85751
{'S1U2_conv1_Weights'} {'S1U2_conv1'} "Weights" -0.29579 0.27284
{'S1U2_conv1_Bias' } {'S1U2_conv1'} "Bias" -0.55448 0.85351
{'S1U2_conv2_Weights'} {'S1U2_conv2'} "Weights" -0.78735 0.52628
{'S1U2_conv2_Bias' } {'S1U2_conv2'} "Bias" -0.50762 0.56423
{'S1U3_conv1_Weights'} {'S1U3_conv1'} "Weights" -0.18651 0.12745
{'S1U3_conv1_Bias' } {'S1U3_conv1'} "Bias" -0.33809 0.73826
{'S1U3_conv2_Weights'} {'S1U3_conv2'} "Weights" -0.49925 0.55922
{'S1U3_conv2_Bias' } {'S1U3_conv2'} "Bias" -0.42145 0.64184
{'S2U1_conv1_Weights'} {'S2U1_conv1'} "Weights" -0.1328 0.121
{'S2U1_conv1_Bias' } {'S2U1_conv1'} "Bias" -0.097249 1.1291
⋮
Use the runonHW
function to:
Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.
To view the code for this function, see Helper Functions.
[~, speedQuantized] = runOnHW(dlqObj,testImage,'zcu102_int8');
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_int8. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### The network includes the following layers: 1 'input' Image Input 32×32×3 images with 'zerocenter' normalization (SW Layer) 2 'convInp' 2-D Convolution 16 3×3×3 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 3 'reluInp' ReLU ReLU (HW Layer) 4 'S1U1_conv1' 2-D Convolution 5 3×3×16 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 5 'S1U1_relu1' ReLU ReLU (HW Layer) 6 'S1U1_conv2' 2-D Convolution 16 3×3×5 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 7 'add11' Addition Element-wise addition of 2 inputs (HW Layer) 8 'relu11' ReLU ReLU (HW Layer) 9 'S1U2_conv1' 2-D Convolution 8 3×3×16 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 10 'S1U2_relu1' ReLU ReLU (HW Layer) 11 'S1U2_conv2' 2-D Convolution 16 3×3×8 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 12 'add12' Addition Element-wise addition of 2 inputs (HW Layer) 13 'relu12' ReLU ReLU (HW Layer) 14 'S1U3_conv1' 2-D Convolution 14 3×3×16 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 15 'S1U3_relu1' ReLU ReLU (HW Layer) 16 'S1U3_conv2' 2-D Convolution 16 3×3×14 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 17 'add13' Addition Element-wise addition of 2 inputs (HW Layer) 18 'relu13' ReLU ReLU (HW Layer) 19 'S2U1_conv1' 2-D Convolution 22 3×3×16 convolutions with stride [2 2] and padding [0 1 0 1] (HW Layer) 20 'S2U1_relu1' ReLU ReLU (HW Layer) 21 'S2U1_conv2' 2-D Convolution 27 3×3×22 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 22 'skipConv1' 2-D Convolution 27 1×1×16 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 23 'add21' Addition Element-wise addition of 2 inputs (HW Layer) 24 'relu21' ReLU ReLU (HW Layer) 25 'S2U2_conv1' 2-D Convolution 30 3×3×27 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 26 'S2U2_relu1' ReLU ReLU (HW Layer) 27 'S2U2_conv2' 2-D Convolution 27 3×3×30 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 28 'add22' Addition Element-wise addition of 2 inputs (HW Layer) 29 'relu22' ReLU ReLU (HW Layer) 30 'S2U3_conv1' 2-D Convolution 26 3×3×27 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 31 'S2U3_relu1' ReLU ReLU (HW Layer) 32 'S2U3_conv2' 2-D Convolution 27 3×3×26 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 33 'add23' Addition Element-wise addition of 2 inputs (HW Layer) 34 'relu23' ReLU ReLU (HW Layer) 35 'S3U1_conv1' 2-D Convolution 37 3×3×27 convolutions with stride [2 2] and padding [0 1 0 1] (HW Layer) 36 'S3U1_relu1' ReLU ReLU (HW Layer) 37 'S3U1_conv2' 2-D Convolution 39 3×3×37 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 38 'skipConv2' 2-D Convolution 39 1×1×27 convolutions with stride [2 2] and padding [0 0 0 0] (HW Layer) 39 'add31' Addition Element-wise addition of 2 inputs (HW Layer) 40 'relu31' ReLU ReLU (HW Layer) 41 'S3U2_conv1' 2-D Convolution 38 3×3×39 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 42 'S3U2_relu1' ReLU ReLU (HW Layer) 43 'S3U2_conv2' 2-D Convolution 39 3×3×38 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 44 'add32' Addition Element-wise addition of 2 inputs (HW Layer) 45 'relu32' ReLU ReLU (HW Layer) 46 'S3U3_conv1' 2-D Convolution 36 3×3×39 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 47 'S3U3_relu1' ReLU ReLU (HW Layer) 48 'S3U3_conv2' 2-D Convolution 39 3×3×36 convolutions with stride [1 1] and padding [1 1 1 1] (HW Layer) 49 'add33' Addition Element-wise addition of 2 inputs (HW Layer) 50 'relu33' ReLU ReLU (HW Layer) 51 'globalPool' 2-D Average Pooling 8×8 average pooling with stride [1 1] and padding [0 0 0 0] (HW Layer) 52 'fcFinal' Fully Connected 10 fully connected layer (HW Layer) 53 'softmax' Softmax softmax (SW Layer) 54 'classoutput' Classification Output crossentropyex with 'airplane' and 9 other classes (SW Layer) ### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software. ### Compiling layer group: convInp>>reluInp ... ### Compiling layer group: convInp>>reluInp ... complete. ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete. ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete. ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete. ### Compiling layer group: skipConv1 ... ### Compiling layer group: skipConv1 ... complete. ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete. ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete. ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete. ### Compiling layer group: skipConv2 ... ### Compiling layer group: skipConv2 ... complete. ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete. ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete. ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete. ### Compiling layer group: globalPool ... ### Compiling layer group: globalPool ... complete. ### Compiling layer group: fcFinal ... ### Compiling layer group: fcFinal ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ ________________ "InputDataOffset" "0x00000000" "4.0 MB" "OutputResultOffset" "0x00400000" "4.0 MB" "SchedulerDataOffset" "0x00800000" "4.0 MB" "SystemBufferOffset" "0x00c00000" "28.0 MB" "InstructionDataOffset" "0x02800000" "4.0 MB" "ConvWeightDataOffset" "0x02c00000" "4.0 MB" "FCWeightDataOffset" "0x03000000" "4.0 MB" "EndOffset" "0x03400000" "Total: 52.0 MB" ### Network compilation complete. ### Programming FPGA Bitstream using Ethernet... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101... ### Copying FPGA programming files to SD card... ### Setting FPGA bitstream and devicetree for boot... # Copying Bitstream zcu102_int8.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/zcu102_int8.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' ### Rebooting Xilinx SoC at 192.168.1.101... ### Reboot may take several seconds... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 07-Mar-2023 11:26:00 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 07-Mar-2023 11:26:00 ### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 210121 0.00084 1 212770 1175.0 convInp 7514 0.00003 S1U1_conv1 7043 0.00003 S1U1_conv2 7378 0.00003 add11 9185 0.00004 S1U2_conv1 7543 0.00003 S1U2_conv2 7292 0.00003 add12 8605 0.00003 S1U3_conv1 10908 0.00004 S1U3_conv2 11192 0.00004 add13 8515 0.00003 skipConv1 7147 0.00003 S2U1_conv1 6392 0.00003 S2U1_conv2 7332 0.00003 add21 4344 0.00002 S2U2_conv1 8832 0.00004 S2U2_conv2 9117 0.00004 add22 4484 0.00002 S2U3_conv1 9175 0.00004 S2U3_conv2 9136 0.00004 add23 4614 0.00002 skipConv2 6643 0.00003 S3U1_conv1 6525 0.00003 S3U1_conv2 6498 0.00003 add31 1520 0.00001 S3U2_conv1 6273 0.00003 S3U2_conv2 6448 0.00003 add32 1450 0.00001 S3U3_conv1 6255 0.00003 S3U3_conv2 6751 0.00003 add33 1500 0.00001 globalPool 3605 0.00001 fcFinal 718 0.00000 * The clock frequency of the DL processor is: 250MHz
Compare the Original, Pruned, and Pruned and Quantized Network Performance
Determine the impact of pruning and quantizing on the network. Pruning improves the network performance to 372 frames per second. However, pruning and quantizing the network improves the performance from 372 frames per second to 1175 frames per second.
fprintf('The performance achieved for the original network is %s frames per second. \n', speedInitial.("Frame/s")(1));
The performance achieved for the original network is 267.2923 frames per second.
fprintf('The performance achieved after pruning is %s frames per second. \n', speedPruned.("Frame/s")(1));
The performance achieved after pruning is 372.5763 frames per second.
fprintf('The performance achieved after pruning and quantizing the network to int8 fixed point is %s frames per second. \n', speedQuantized.("Frame/s")(1));
The performance achieved after pruning and quantizing the network to int8 fixed point is 1174.9777 frames per second.
References
[1] Krizhevsky, Alex. 2009. "Learning Multiple Layers of Features from Tiny Images." https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Helper Functions
The runOnHW
function prepares the network for deployment, compiles the network, deploys the network to the FPGA board, and retrieves the prediction results.
function [result, speed] = runOnHW(network, image, bitstream) wfObj = dlhdl.Workflow(Network=network,Bitstream=bitstream); wfObj.Target = dlhdl.Target("xilinx", Interface="Ethernet"); compile(wfObj); deploy(wfObj); [result,speed] = predict(wfObj,image, Profiler='on'); end
See Also
dlhdl.Target
| dlhdl.Workflow
| compile
| deploy
| predict
| dlquantizer
| calibrate