Video length is 3:58

Generate HDL for a Deep Learning Processor

Implementing deep learning inference efficiently in edge applications requires collaboration between the design of the deep learning network and the deep learning processor.

Deep Learning HDL Toolbox™ enables FPGA prototyping of deep learning networks from within MATLAB®. To increase performance or target custom hardware, you can explore trade-offs in MATLAB to converge on a custom FPGA implementation of the deep learning processor. Then a single MATLAB function drives HDL Coder™ to generate an IP core with target-independent synthesizable RTL and AXI interfaces. It can also optionally run FPGA implementation to create a bitstream to program the deep learning processor onto the device.

Published: 27 Aug 2020

Deep Learning HDL Toolbox delivers FPGA prototyping of deep learning inferencing from within MATLAB, so you can quickly iterate and converge on a network that delivers the performance for your system requirements while meeting your FPGA constraints.

But what if you want to customize the FPGA implementation, to improve performance or to target a custom board? For this, you can use MATLAB to configure the processor and to drive HDL Coder to generate an IP core with RTL and AXI interfaces.

This is all based on a deep learning processor architecture that has generic convolution and fully connected modules, so you can program your custom network and the logic that controls which layer is being run, along with its activation inputs and outputs. Since each layer’s parameters need to be stored in external DDR memory, the processor also includes high-bandwidth memory access.

You can customize this deep learning processor for your system requirements, which coupled with the ability to customize the deep learning network, delivers a lot of options to optimize FPGA implementation for your application.

To illustrate, let’s look an application that uses a series network that’s trained to classify logos. Let’s say we need to process 15 frames per second.

So we just load the trained network.

And we will set up a custom processor configuration with all default settings and running at 220 MHz. Note the data types and amount of parallel threads for the convolution module and fully-connected module. And this is set up by default to target a ZCU102 board, which is what we are using.

Then we apply the processor config to a workflow object for the trained network

Now we can estimate the performance of this custom processor before we deploy it. The result is the total latency here, which at 220 MHz means the frame rate would be just under 6 frames per second, which is not going to meet our system requirements.

This is where it’s important to collaborate because we have options.  Let’s say we’re committed to this board. And our deep learning expert doesn’t think we can remove any layers and get the same accuracy, but we might be able to quantize to int8. Going from 32-bit to 8-bit word lengths gives us the resources to perform more multiply-accumulates in parallel. 

So we’ll set up a new custom processor configuration object, with both the convolution and fully-connected layers set to int8, and increase the parallel thread count by 4x for each.

Now, we need to quantize the network itself in order to estimate its performance on the deep learning processor. You can learn more about this process in the documentation. It takes a minute to run, and returns for each layer the numeric ranges for the given calibration data store. Normally we would run more calibration images and then validate with another set, but…

Let’s see the estimation results for this new processor configuration – now we’re up to 16 frames per second, which is good enough for our fictional requirements.

From here, the buildProcessor function does the rest. It calls HDL Coder to generate target-independent synthesizable RTL for the processor you’ve configured. And if you have set up a reference design it will generate an IP core with the AXI register mapping so it plugs right into implementation. And if you’ve set up and defined an implementation workflow, it runs all the way through generating a bitstream to program the device.

We can take a look at the implementation results, here in Vivado. We’re meeting timing at the 220 MHz target, with the resource usage shown here.

This shows how powerful it can be to collaborate between the design of the deep learning network and the implementation of the deep learning processor, and how easy it is to do right in MATLAB.

Related Products