Data Types and Scaling in Digital Hardware

Fixed-Point Data Types

In digital hardware, numbers are stored in binary words. A binary word is a fixed-length sequence of bits (1s and 0s). How hardware components or software functions interpret this sequence of 1s and 0s is defined by the data type. Binary numbers are represented as either fixed-point or floating-point data types.

A fixed-point data type is characterized by the word length in bits, the position of the binary point, and whether it is signed or unsigned. The position of the binary point is the means by which fixed-point values are scaled and interpreted.

For example, a binary representation of a generalized fixed-point number (either signed or unsigned) is shown below:

where

b_i is the i^th binary digit.
wl is the word length in bits.
b_wl-1 is the location of the most significant, or highest, bit (MSB).
b₀ is the location of the least significant, or lowest, bit (LSB).
The binary point is shown four places to the left of the LSB. In this example, the number is said to have four fractional bits, or a fraction length of four.

Fixed-point data types can be either signed or unsigned. Whether a fixed-point value is signed or unsigned is usually not encoded explicitly within the binary word; that is, there is no sign bit. Instead, the sign information is implicitly defined within the computer architecture.

Signed binary fixed-point numbers are typically represented in computer hardware in one of three ways:

Sign/magnitude – One bit of a binary word is always the dedicated sign bit, while the remaining bits of the word encode the magnitude of the number. Negation using sign/magnitude representation consists of flipping the sign bit from 0 (positive) to 1 (negative), or from 1 to 0.
One's complement – Negating a binary number in one's complement requires a bitwise complement. That is, all 0s are flipped to 1s and all 1s are flipped to 0s. In one's complement notation there are two ways to represent zero. A binary word of all 0s represents "positive" zero, while a binary word of all 1s represents "negative" zero.
Two's complement – Negation using signed two's complement representation consists of a bit inversion (translation into one's complement) followed by the binary addition of a one. For example, the two's complement of 000101 is 111011.

Two's complement is the most common representation of signed fixed-point numbers and is the only representation used by Fixed-Point Designer™ documentation.

Binary Point Interpretation

The binary point is the means by which fixed-point numbers are scaled. It is usually the software that determines the binary point. When performing basic math functions such as addition or subtraction, the hardware uses the same logic circuits regardless of the value of the scale factor. In essence, the logic circuits have no knowledge of a scale factor. They are performing signed or unsigned fixed-point binary algebra as if the binary point is to the right of b₀.

Fixed-Point Designer supports general binary point scaling V = Q ✕ 2^E, where V is the real-world value, Q is the stored integer value, and the fixed exponent E is equal to the negative of the fraction length. In other words, RealWorldValue = StoredInteger ✕ 2^{−FractionLength}.

The fraction length defines the scaling of the stored integer value. The word length limits the values that the stored integer can take, but it does not limit the values that the fraction length can take. The software does not restrict the value of the exponent E based on the word length of the stored integer Q. Because E is equal to −FractionLength, restricting the binary point to being contiguous with the fraction is unnecessary; the fraction length can be negative or greater than the word length.

For example, a word consisting of three unsigned bits is usually represented in scientific notation in one of the following ways:

$\begin{array}{l} b b b . = b b b . \times 2^{0} \\ b b . b = b b b . \times 2^{- 1} \\ b . b b = b b b . \times 2^{- 2} \\ . b b b = b b b . \times 2^{- 3} \end{array}$

If the exponent were greater than 0 or less than -3, then the representation would involve additional zeros:

$\begin{matrix} b b b 00000. = b b b . \times 2^{5} \\ b b b 00. = b b b . \times 2^{2} \\ .00 b b b = b b b . \times 2^{- 5} \\ .00000 b b b = b b b . \times 2^{- 8} \end{matrix}$

These extra zeros never change to ones, so they do not show up in the hardware. Unlike floating-point exponents, a fixed-point exponent never shows up in the hardware, so fixed-point exponents are not limited by a finite number of bits.

Consider a signed value with a word length of 8, a fraction length of 10, and a stored integer value of 5 (binary value 00000101). The real-word value is calculated using the formula RealWorldValue = StoredInteger ✕ 2^{−FractionLength}. In this case, RealWorldValue = 5 ✕ 2⁻¹⁰ = 0.0048828125. Because the fraction length is 2 bits longer than the word length, the binary value of the stored integer is x.xx00000101, where x is a placeholder for implicit zeros. 0.0000000101 (binary) is equivalent to 0.0048828125 (decimal). For an example using a fi object, see Fraction Length Greater Than Word Length.

Floating-Point Data Types

Floating-point data types are characterized by a sign bit, a fraction (or mantissa) field, and an exponent field. Fixed-Point Designer adheres to the IEEE^® Standard 754-1985 for Binary Floating-Point Arithmetic (referred to simply as the IEEE Standard 754 throughout this guide) and supports half-, single- and double-precision data types.

When choosing a data type, you must consider these factors:

The numerical range of the result
The precision required of the result
The associated quantization error (i.e., the rounding mode)
The method for dealing with exceptional arithmetic conditions

These choices depend on your specific application, the computer architecture used, and the cost of development, among others.

With Fixed-Point Designer, you can explore the relationship between data types, range, precision, and quantization error in the modeling of dynamic digital systems. With Simulink^® Coder™, you can generate production code based on that model. With HDL Coder™, you can generate portable, synthesizable VHDL^® and Verilog^® code from Simulink models and Stateflow^® charts.