Get Started with Deep Learning FPGA Deployment on Xilinx ZC706 SoC
This example shows how to create, compile, and deploy a handwritten character detection series network to an FPGA and use MATLAB® to retrieve the prediction results.
Prerequisites
Xilinx® Zynq® ZC706 Evaluation Kit
Deep Learning HDL Toolbox™ Support Package for Xilinx® FPGA and SoC
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Load Pretrained Network
Load the pretrained network trained on the Modified National Institute of Standards and Technology (MNIST) database[1].
net = getDigitsNetwork;
View the layers of the pretrained network by using the Deep network Designer app.
deepNetworkDesigner(net)
Define FPGA Board Interface
Define the target FPGA board programming interface by using the dlhdl.Target
object. Create a programming interface with custom name for your target device and a JTAG interface to connect the target device to the host computer. To use JTAG, install Xilinx™ Vivado™ Design Suite 2023.1. Set the toolpath by using the hdlsetuptoolpath
function.
hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2023.1\bin\vivado.bat');
hTarget = dlhdl.Target('Xilinx');
Prepare Network for Deployment
Prepare the network for deployment by creating a dlhdl.Workflow
object. Specify the network and bitstream name. Ensure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx® Zynq® ZC706 Evaluation Kit and the bitstream uses the single data type.
hW = dlhdl.Workflow(Network=net,Bitstream='zc706_single',Target=hTarget);
Compile Network
Run the compile
method of the dlhdl.Workflow
object to compile the network and generate the instructions, weights, and biases for deployment.
dn = compile(hW)
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zc706_single. ### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware. ### The network includes the following layers: 1 'imageinput' Image Input 28×28×1 images with 'zerocenter' normalization (SW Layer) 2 'conv_1' 2-D Convolution 8 3×3×1 convolutions with stride [1 1] and padding 'same' (HW Layer) 3 'relu_1' ReLU ReLU (HW Layer) 4 'maxpool_1' 2-D Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 5 'conv_2' 2-D Convolution 16 3×3×8 convolutions with stride [1 1] and padding 'same' (HW Layer) 6 'relu_2' ReLU ReLU (HW Layer) 7 'maxpool_2' 2-D Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0] (HW Layer) 8 'conv_3' 2-D Convolution 32 3×3×16 convolutions with stride [1 1] and padding 'same' (HW Layer) 9 'relu_3' ReLU ReLU (HW Layer) 10 'fc' Fully Connected 10 fully connected layer (HW Layer) 11 'softmax' Softmax softmax (SW Layer) 12 'Output1_softmax' Regression Output mean-squared-error (SW Layer) ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. ### Compiling layer group: conv_1>>maxpool_2 ... ### Compiling layer group: conv_1>>maxpool_2 ... complete. ### Compiling layer group: conv_3>>relu_3 ... ### Compiling layer group: conv_3>>relu_3 ... complete. ### Compiling layer group: fc ... ### Compiling layer group: fc ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ _________________ "InputDataOffset" "0x00000000" "184.0 kB" "OutputResultOffset" "0x0002e000" "4.0 kB" "SchedulerDataOffset" "0x0002f000" "204.0 kB" "SystemBufferOffset" "0x00062000" "76.0 kB" "InstructionDataOffset" "0x00075000" "52.0 kB" "ConvWeightDataOffset" "0x00082000" "28.0 kB" "FCWeightDataOffset" "0x00089000" "76.0 kB" "EndOffset" "0x0009c000" "Total: 624.0 kB" ### Network compilation complete.
dn = struct with fields:
weights: [1×1 struct]
instructions: [1×1 struct]
registers: [1×1 struct]
syncInstructions: [1×1 struct]
constantData: {{} [1×1568 single]}
ddrInfo: [1×1 struct]
resourceTable: [6×2 table]
Program Bitstream onto FPGA and Download Network Weights
To deploy the network on the Xilinx® Zynq® ZC706 hardware, run the deploy
method of the dlhdl.Workflow
object. This method programs the FPGA board using the output of the compile method and the programming file, downloads the network weights and biases, displays progress messages, and the time it takes to deploy the network.
deploy(hW)
Test Network
Load the example image.
inputImg = imread('five_28x28.pgm'); inputImg = dlarray(single(inputImg),'SSCB');
Classify the image on the FPGA by using the predict
method of the dlhdl.Workflow
object and display the results.
[prediction,speed] = hW.predict(single(inputImg),'Profile','on');
### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 62340 0.00062 1 63218 1581.8 imageinput_norm 3527 0.00004 conv_1 10376 0.00010 maxpool_1 6774 0.00007 conv_2 11786 0.00012 maxpool_2 5450 0.00005 conv_3 17181 0.00017 fc 7214 0.00007 * The clock frequency of the DL processor is: 100MHz
[val,idx] = max(prediction);
fprintf('The prediction result is %d\n', idx-1);
The prediction result is 5
Bibliography
LeCun, Y., C. Cortes, and C. J. C. Burges. "The MNIST Database of Handwritten Digits." https://yann.lecun.com/exdb/mnist/.