Hardware/Software Partitioning | Developing Radio Applications for RFSoC with MATLAB & Simulink, Part 3
From the series: Developing Radio Applications for RFSoC with MATLAB and Simulink
Perform simulation and analysis of the SoC architecture of the Xilinx® RFSoC to investigate hardware/software partitioning of the range-Doppler radar algorithm.
In this third video in the series, learn how to develop a Simulink® model that serves as a reference for verifying implementation models.
See how to analyze the algorithm’s memory requirements to determine whether external DDR4 memory is required for hardware implementation. Then learn how to evaluate two candidate hardware/software partitioning alternatives by comparing the effects of performing the FFT operation in the quad-core Arm® Cortex®-A53 processor versus performing the FFT in programmable logic.
Explore how to model the DDR4 memory transactions using Memory Controller and Traffic Generator blocks of SoC Blockset™, and use simulation to determine the latency of memory write and read operations. Pre-characterized models for the Xilinx ZCU111 development board enable accurate evaluation of latency using simulation, without the need for hardware testing.
Then using processor-in-the-loop (PIL) testing, you can perform on-device profiling and measurement of latency for the algorithm running on the processor.
These techniques allow you to determine the latency and implementation complexity of each option so you can decide on an approach that best meets requirements.
In Part 4 of this video series, you will see how SoC Blockset drives the process of generating a complete hardware/software application and deploying it to the ZCU111 development board.
Published: 7 Jan 2021
In this video series, you'll learn how to use a model based design approach to develop radio frequency applications for Xilinx RFSoC platform. In part 2, we looked at a real world example application, developed a range-Doppler radar system with RRSoC. We saw how to take high level specifications and use that to make system level engineering decisions using modeling and simulation. Part 3, we'll continue with the range-Doppler radar example and evaluate different hardware-software partitioning strategies using a mix of simulation and hardware profiling based analysis.
So after we've determined our system specifications and parameters, the next step of the process here is to partition our algorithm between hardware and software, and the first step in that process is to understand exactly what the algorithm is doing and elaborate it. We need to know what's going on under the hood so that we can intelligently assign the different processing blocks between that PGA and processor. And so we saw, in the previous model I showed, there's a range-Doppler response block which was producing that plot we saw. This is our processed range-Doppler output.
The MATLAB source code is actually available for that block. If you double click and bring up the block parameters, it will bring you to a source code link, so you can see exactly what the MATLAB code is that's used to implement it under the hood. And so this is extremely helpful with elaborating your high level functions into the low level processing blocks that you actually need to implement.
We can look at a model here that I've created as part of this algorithm elaboration step. So it actually looks pretty similar to the previous model. We have our waveform coming in. Linear FM same as before. We plug it into our same radar target model because this is a model reference. We can just reuse it.
But then this range-Doppler processing block is slightly different. What I've done is add a MATLAB function block, which implements the range-Doppler response block-- reimplements it in basic MATLAB code. And so again, I looked under the hood of that range-Doppler response block, and what you'll find here is it's really only two basic operations taking place.
We have a Match Filter operation given by this Filter command, where the coefficients are the waveform in reverse. And then we take the filter output and plug that through an FFT. So filters and FFT is very common processing blocks in DSP applications, and that's really all that's going on here in terms of the range-Doppler processing.
So then we can instantiate our MATLAB version of that block in parallel with the reference and then simulate them together. And so we see, visually, the output looks the same. Here's our two targets. But more importantly, we can numerically compare the outputs and make sure they're exactly the same because we want to make sure that we account for all the details here that we are doing the exact same thing as what's going on under the hood here.
And so we use this assertion block. This is comparing that my reference and my own implementation are indeed the same. And so this is a really valuable exercise to go through just to understand here more about things like the data sizes and types, what are the underlying operations here that you're going to need to implement.
And so after having done that algorithm elaboration step, we can take our analysis a bit further here and think about the way that the data is coming into our system. So we have arranged samples for a given pulse, and we integrate multiple pulse intervals here to form up a matrix of samples. We saw that, on the range dimension, we're computing a matched filter, and one thing to glean here is that the range computation can be performed immediately as the data comes in from the ADC. We don't have to put it anywhere. We can just stream those samples directly to a filter as they come in.
Now, on the pulse interval dimension, we are computing in FFT, and the thing to notice here is that the entire frame needs to be present before we can start doing that FFT. So we need to put all these range samples somewhere in the meantime until we formed up the whole matrix, and then we can start processing the FFT. And so now, here, we're going to lay out two possible approaches for how to partition this processing chain.
So we have three basic blocks-- our matched filter, an FFT, and then detection, which is that 2DC file. We're going to leave that out of this analysis for now. Really, just consider the partitioning of the match filter in FFT. So we have option A is to do those first two processing steps all in the FPGA, and option B will do just the match filter on the FPGA, and do the FFT on the ARM.
And so let's look at some ways that we can analyze the various options here. So for the first option, again, that is the FFT taking place on the FPGA. Let's consider the size of our data matrix.
So we have up to 4,096 range samples times 512 pulses per CPI at 24 bits per sample. We have IQ of 12 bits. That's 48 megabits of data to store that whole matrix.
And a simple approach would be to just store it all on block RAM on the device. That would make accessing it very easy. Doesn't have to leave the chip.
But we look here at our product table, and for the ZU28DR device that's on the board we're targeting, we see we have 38 megabit of block RAM. So unfortunately, that's too much data to fit completely on a chip, so what that means is we have to store it temporarily in external DDR4. So we're going to write the data to DDR in order as it comes in after the match filter. And then later read it out from DDR in transposed order as we access the range dimension for the FFT. And so storing the data off chip is going to invoke another level of analysis we need to perform here, and that's basically how long is it going to take to do this operation of storing off chip in external memory and then reading it out in transposed order.
Now, this would be something that would take some decent amount of engineering time to go and measure in hardware, but one thing we can do is actually simulate this using the memory controller and memory traffic generator blocks from SoC blocks and simulate. And so my model here does that. We have a write traffic generator, which is simulating our in-order write operation, and the read traffic generator does the transposed read operation.
And so after running simulation, double click here on this memory controller block, and first thing I want to point out here is we have our hardware board here at the ZCU11 kit. So what that means is we've characterized the memory controller on that board specifically and populated some behavioral parameters here, so that we're actually simulating the memory interface specific to that board. So after running simulation, you can bring up this performance plot.
I'm going to plot the bandwidth over simulation time for master 1 and master 2, which are my write traffic and read traffic respectively. And so what we're going to see here is that the in-order write operation happens very quickly. We get a maximum bandwidth here of 500 megabytes per second, but the read operation is very slow. So we only get up to about 25 megabytes per second, and the reason for that is because the data is stored in memory out of order.
So we have to read words one at a time, whereas the in-order write operation can be really fast because we can use large burst lengths. And so putting that together, we see a total time estimate here for this transpose operation with DDR of about 438 milliseconds. So that's just a metric where to keep in mind as a result of analyzing this option A.
Now, for option B, so if we think about, again, the size of our data here, this is 4,096 by 512 matrix we're dealing with, so what are we actually doing? The FFT length is 512, and we have to do that 4,096 times. And one way we can speed this operation up is actually take advantage of the quad core ARM processor inside the processing system and then distribute the workload between the four cores in order to get a four times speed up, versus doing it on one core. So we compute the FFT for range bins 1 through 1,024 on core one, and so on divided between the four cores, and then finally gather the full output frame.
Now, to analyze the timing of this, there's different ways you could try and estimate, just based on the algorithm, how long this would take. But really, the best thing to do here is to go and measure it on the hardware itself. It's actually really easy to do using a tool called Processor-in-the-Loop, which lets us do profiling on the device and actually measure the real numbers. So that's going to be the most accurate way to do it, of course, and it turns out it's really easy to do using Simulink.
And so we open up our Processor-in-the-Loop model here, and so you can see I have some test input here, which is 1,024 by 512 matrix. Again, that's 1/4 of the full 4,096 by 512 matrix. Dig in here. I've got my reference model doing the transpose. Do a data type conversion from N16 to floating point single so that we can take advantage of the floating point neon instructions. And then we got an FFT block here, which is, with embedded coder, this is already going to generate target optimized code for the ARM that takes advantage of that floating point engine.
And so it's a really simple model to build up. I'm going to run this, and you'll note here, with this PIL notation here, it's going to run this target model reference using generated code on the actual hardware target. And so what it's doing now is compiling that, and we're going to go out and run it and return the timing results.
And so when that finishes running, we'll get this profiling report back, and so we can dig into that. And what we'll see is that, on average, it takes about 476 milliseconds to process that frame on one core. And so the FFT takes up most of that, but we account for the other, the memory transpose and data type conversion, as well.
And so now, we can compare some numbers here between option A and option B. So we saw, using simulation based estimate, that the FPGA FFT would take about 438 milliseconds to complete, whereas the software based FFT we measured was about 476 milliseconds. But latency is not the only factor to consider.
The FPGA based FFT, we saw, would require some additional complexity here with actually writing the control logic to store that data in memory and read it out in a transposed order, whereas the software based FFT was really simple to build using Simulink and generate already target optimized c-code. So we're going to go with option here we'll take a little bit of latency hit at the tradeoff of it being a simpler path forward for now. We can always revisit and go back later if we determine that the latency numbers aren't acceptable, but we're going to go with the software based FFT based on our analysis.
So now, we move into the design and implementation phase where we're integrating our hardware, software, and interface together using that partitioning scheme that we came up with. And so here's a model that does that. So we have our radar target model, same one that we've been using in the previous models.
Here's our FPGA subsystem, and this contains the Match Filter block, and that is streaming data here through our DMA memory channel. This will place the data in a shared DDR to be read out by the processor later, and so, inside the processor subsystem, you can see we've partitioned out the processing into these four separate tasks. So these are going to run in the four different cores of that ARM processor.
So go ahead and run the model, and there's a couple of things to point out here. So we can bring in here our displays. So one thing we're seeing here is the effect of quantization. So I'm simulating the ADC and DAC quantization and the effect that has and bringing the noise floor up.
And we can also look at the memory controller here, and we'll see, like we'd expect here, master 1. This is the FPGA writing data in memory. We see the same almost 500 megabyte per second bandwidth, which matches, again, what we saw with the model we were using to test the performance of the transpose operation. But here, the FPGA just sending the output of the filter into memory. We get a really high bandwidth achieved here, 500 megabytes per second.
And then another thing we can look at here, on the processor side, so we can actually simulate the latency of the processing tasks and plugin the timing information that we gathered using the Processor-in-the-Loop profiling. And so we'll see that here. So I'm going to plot the runtime of each of my tasks.
Remember I had four different data tasks. And so I'll plot those four separately. They all run for the same amount of time. You can see that the result isn't available until the end of this.
So regular Simulink blocks execute with 0 duration. We actually simulate the-- or incorporate the timing of your software into your system level simulation, and that lets you characterize the latency of your system as a whole by incorporating those software timing effects. This concludes part 3 of the video series. In part 4, will show how to generate C and HDL code for the range-Doppler radar algorithm and automate the deployment of our prototype design to the Xilinx ZCU11 development kit.