2 views (last 30 days)

Show older comments

Many references describe implementing fixed-point designs using fractional data types. These signed fractional data types represent values from -1 to just under +1.

As an example, consider the finitie-impulse-response filter (FIR) shown in the image. I'd like to convert this to fixed-point using the fractional types that are recommended in many reference materials.

But, the coefficiencts are beyond the interval [-1, +1), so how is do I achieve this design using fractional types?

Andy Bartlett
on 6 May 2021

Edited: Andy Bartlett
on 6 May 2021

I recommend against trying to force a design to use fractional types. Using general fixed-point scaling instead of limiting only to fractional scaling provides a significant advantage. By using general fixed-point scaling, the constants and variables will show their values on scopes, displays, etc., using the same engineering units that were prefered in the original floating-point design. This makes testing and debugging much easier and is less prone to mistakes.

In this answer, I'll show three ways to model the design. First, using general fixed-point scaling. Second, using raw integer modeling (also not recommended). Third, attempting to use fractional types. All the designs assume the input, in the original engineering units, ranges over the interval [-8, 8). In all three designs, the input and final output will be 8 bits.

General fixed-point scaling

The general fixed-point design is shown here.

Notice that the coefficients of the FIR have the same real-world values as the original floating-point design (within quantization levels). The second display block shows the stored integer values that would be observed in the low level C code generated from this model. The FIR is using the common pattern of full precision for intermediate calculations. For example, the product block has 8 bit inputs and a 16 bit output. Also notice that the inputs have scaling 2^-4 and 2^-2, while the ouput has the product of those two scalings 2^-6. The reducing sum block like it's input has 6 bits to the right of the binary point, so no precision is lost there either. Also the sum block accumulator type and output type are 32 bits, so the reducing addition of three inputs cannot overflow. The cast from the 32 bit accumulator type to the final 8 bit output does lose precision. The final output type fixdt(1,8,-2) covers the range [-512, 508] which is just big enough to prevent any overflows.

In the scope below, the first and second rows show the floating-point baseline and this general fixed-point design. Notice that the shapes of the response are the same. Also notice that the values have the same engineering units and both range from about -380 to +380.

Raw Integer Scaling

People with a background in coding in C or HDL often try to model the fixed-point math using raw integers. This takes extra care but can be done as shown in the following image.

Tip: notice the cast block on the left displays (SI) on the icon. In that mode, the cast block is ignoring the input scaling and is just reinterpreting the input as a raw integer. This allows the same test model to be used for all the designs.

Notice that all the details needed of the raw integer design need to be handled by the user. For example, the constant block that held the FIR coefficients needs raw integer values for its parameter. When using raw integers for the coefficients, there is no automated way to see that their meaning agrees with the original real world values.

In the scope above, the first and third rows show the floating-point baseline and the raw integer design. Notice that the shapes of the response are the same. But the units are clearly not the same. Original covers -380 to +380, but the raw integers are around -90 to +90. For testing and debugging, the user needs to take care to convert these to common units before comparision.

Signed Fractional Design

The third approach attempts to partially use signed fractional types as shown here.

The signed 8 bit input has been reinterpreted as signed fractional. Also the FIR coefficient values were scaled down by 32 X to fit in a signed fractional type. This scaling of the coefficients needs to be done manually (study attached script.) Because the FIR is linear, normalizing the the input and the coefficients to -1 to +1 range can be done independently. Due to linearity, this will only change the final output by a constant gain factor. If the design had nonlinear elements, then attempts to normalize any signal would need to be carefully coordinated with every other dependent signal and operation. For a nonlinear design, great care would be needed to correctly coordinate the scaling changes.

The product and sum operations in this FIR were not changed to fractional types because they were not a natural fit to the resulting math. The natural product of signed 8 bit fractional types is signed 16 bits with fraction length 14 (not 15). Immediately forcing the product output to signed 16 bit fractional could lead to overflow (for input case -1 times -1 = +1) and would shift in a least significant bit that was uselessly always zero. The sum uses the same scaling and provides extra range for the sum which could go from -3 to +3 for arbitrary coefficients. Clearly, -3 to +3 doesn't fit in a fractional type. The worst case output for this FIR is roughly -1.3 to +1.3 which fractional cannot handle without overflow. Hence the final output type has range [-2, 2). If we wanted to present fractional type to the next stage of the model, we could insert another (SI) cast to introduce another scaling reinterpretation.

In the scope above, the first and fourth rows show the floating-point baseline and the partial fractional design. Notice that the shapes of the response are the same. But the units are clearly not the same. Original covers -380 to +380, but the raw integers are around -1.3 to +1.3. For testing and debugging, the user needs to take care to convert these to common units before comparision.

Embedded Implementation is Identical

The generated general C code for an ARM Cortex M target for these three designs is identical as shown in the image below.

They all multiply the same raw integer input by the same integers constants to produce three signed 16 bit products. They all add their three 16 bit products in a 32 bit accumulator. The all shift the accumulation 8 bits to the right, then down cast to 8 bits.

Final Recommendation

Three ways to model the original FIR in fixed-point have been presented. They all produced identical generated C code in the end. The general fixed-point scaling approach required the least work from the user. General scaling retained the user's prefered engineering units from the original design. Retaining the user's prefered units is a major advantage for testing and debugging. The raw integer and fractional types approaches can be done but require a lot more work and care from the user. Given identical generated code and an advantageous workflow, I recommend using the general fixed-point scaling approach.

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!