Floating-point or fixed-point: a choice between precision and efficiency

Published on 18 June 2024
If you’re looking to optimize software that handles real values, then there’s a good chance that switching data to fixed-point arithmetic will save you runtime and energy.

 

Baptiste Daniel Lamazière
R&D Engineer, specialized in fixed-point arithmetic

_____________________________________________________________________________________________________

Representation of decimal numbers

Floating-point

In computing, the data type most commonly used to represent real numbers is float. This data type is defined by the IEEE 754 standard, which defines the calculations and behavior of the float type. According to IEEE-754, a float represents a number in 3 parts: 

  1. Its sign s 
  2. Its exponent e
  3. Its mantissa m

We then find the value of the number using the following formula given by the IEEE 754[08] standard:

The IEEE 754 standard allows numbers to be represented with different exponent values that adjust to the values being represented at runtime. Using the standard is relatively straightforward for developers: numbers can be written directly in their decimal form, and the standard supports arithmetic errors such as overflow or division by zero.

Floating-point arithmetic requires the implementation of specific arithmetic operations. Initially, these were implemented in the software, forcing developers to program these operations themselves. 

Beginning in the 1950s, however, processors began to include FPUs (Floating Point Units, coprocessors dedicated to floating-point arithmetic), which allowed floating-point calculations to be performed quickly and without developer intervention.

Advances in hardware FPUs have made it possible to use float types on many computers with very low computation times.

 

Fixed-point

Although the floating-point representation method is the most widely used, there are other formats, including fixed-point. Fixed-point is an older format than floating-point, and was used before the advent of standards and electronic components that allowed calculations to be performed on numbers represented in floating-point arithmetic. However, this format is still used in cases where the hardware target running the software does not have an FPU.

Fixed-point representation involves representing a real number in the same way as an integer. The integer part is represented as an integer, while the fractional part is represented as an inverse power of 2 (2-1,2-2, etc.) [Yates]. Fixed-point representation therefore encompasses several formats. For example, on 32 bits, we can represent real numbers using a fixed-point representation known as Q0.31, i.e. with one sign bit, no integer part bit and a 31-bit fractional part; we can also opt for a Q15.16 format, i.e. with one sign bit, an integer part coded on 15 bits and a fractional part coded on 16 bits.

To find the value of a fixed-point number, use the following formula:

where

  • v is the value of the number,
  • s is the sign bit,
  • 0 if positive and 1 if negative;
  • m is the size of the integer part,
  • and n is the size of the fractional part of our representation;
  • finally bi is the value of index bit i in the fixed-point representation.

And, to convert a decimal number into its fixed-point equivalent, we apply the following algorithm:

where

  • fxp is the variable storing the fixed-point representation,
  • int is the integer part of the decimal number
  • and frac is the fractional part of the decimal number. 

To understand why floating-point representation is often preferred to fixed-point representation, we also need to compare the range of values that can be represented with a single variable of each representation.

 

Dynamic range

To measure the range of values that a variable can represent, we use a quantity called dynamic range. Dynamic range is the value, expressed in decibels, of the ratio of the largest attainable value to the smallest possible value. The formula for dynamic range is therefore as follows, where XMAX is the highest amplitude value that can be represented, and XMIN is the lowest amplitude value.

The value of the dynamic range therefore depends on the size of the exponent in the representation format. If we take the example of the float data type, which represents real numbers on 32 bits with 1 sign bit, 8 exponent bits and 23 mantissa bits, we can estimate the dynamic range as follows: with 8 bits to represent the exponent, the largest possible exponent value is 28-1 = 127 and the smallest is -127. The XMAX / XMIN ratio is therefore 22*127+1,

The dynamic range of the float data type (IEEE 754) is then 1535 (Not bad, not bad! Well, it depends on what you’re comparing it with).

If we compare it with the dynamic range of the fixed-point, we realize that the dynamic range evolves linearly as a function of the number of bits used to encode the fixed-point variable. If we note NMAX the index of the most significant bit and NMIN the index of the least significant bit, we obtain the following formula:

For 32 bits, this gives a dynamic range in decibels of 186. A far cry from the 1535 of the float format.

 

Advantages of fixed-point

However, even though the fixed-point has a smaller dynamic range, it allows binary and integer operations to be used for its operations, which is often less energy-consuming than using float operations for the same calculation. 

The following table details these performance differences, for addition and multiplication operations, on ac_int and ac_float types on 32 and 64 bits, on an ASIC target (a 28nm Fully Depleted Silicon On Insulator, FDSOI), we compare for each operation on the different types: 

  1. Surface area used by the operation
  2. Total power used by the design
  3. The critical path (i.e. latency) of the operation
  4. Power-Delay Product (PDP), the energy required to perform the operation
Area (µm2) Total power (mW) Critical path (ns) Power-Delay  Product (fJ)
32-bit float ADD 653 4.39E-4 2.42 1.06E-3
64-bit float ADD 1453 1.12E-3 4.02 4.50E-3
32-bit int ADD 189 3.66E-5 1.06 3.88E-5
64-bit int ADD 373 7.14E-5 2.10 1.50E-4
32-bit float MUL 1543 8.94E-4 2.09 1.87E-3
64-bit float MUL 6464 6.56E-3 4.70 3.08E-2
32-bit int MUL 2289 6.53E-5 2.38 1.55E-4
64-bit int MUL 8841 1.84E-4 4.52 8.31E-4

Table 1: Comparison of costs for floating-point and integer operations

The results of this table show that switching to fixed-point arithmetic will reduce the latency of 32- and 64-bit addition and 64-bit multiplication operations. What’s more, switching to fixed-point arithmetic will reduce the power consumption of addition and multiplication operations, regardless of data size.

 

In summary, by using fixed-point arithmetic, some arithmetic instructions (such as addition) become shorter in number of cycles than floating-point operations. However, this is not the case for all operations, because multiplication in fixed-point arithmetic takes longer than multiplication in floating-point arithmetic. 

Also, fixed-point arithmetic operations require less power to perform than floating-point arithmetic operations (Table 1).

 

How do you convert code from floating-point to fixed-point?

When converting code to fixed-point, we can’t adapt to the orders of magnitude of the values taken by the variable at runtime, as we can with floating-point. So, before executing the code, you need to know what values the variable will take.

To do so, WedoLow performs a dynamic analysis of the floating-point variables in the program you wish to optimize. We instrument these variables by replacing their type with a type that will record all the values taken by the variable during program execution. 

There are many ways to find out the order of magnitude of the values that variables take when code is executed, but it’s best to choose a method that you can automate (especially if you want to convert a large number of variables). A method that gives a readable and understandable result (displaying all the values a variable takes in the console may not be very good for your view), and that can be exploited.

 

At the end of this dynamic analysis, we can define the format of the fixed-point representation that each variable will take. Be careful, because a representation format that is inappropriate for the values taken by the variable can lead to precision errors (i.e., loss of low-order bits because the integer part is larger than necessary) or overflow errors (i.e., loss of high-order bits because the fractional part is too large).

You should also be careful not to write operations between variables of different formats too often, as this requires the addition of offsets. These offsets generate additional instructions in the code generated in machine language, and thus additional execution time. 

To avoid having too many offsets in our fixed-point code, we can group together variables that have operations in common and assign them the same representation format. 

The trade-off for each variable is between precision and reducing the number of instructions. 

For each variable, you can choose an optimal format according to the maximum exponent value the variable will take. But you can also choose to split a variable into 2 to have 2 variables represented more accurately, or conversely, you can choose to represent several variables with the same format to reduce execution time.

 

Verifications

Once optimization is complete, it is important to verify that the optimized code still produces the same result as the original version. Therefore, we perform a non-regression test with the same input data and quantify the difference between the output values to calculate the error caused by the change in representation format. 

Please note that fixed-point types do not necessarily have an error handling mechanism, as is the case for IEEE-754 floats. There is no NaN (Not a Number), division by zero is not handled and it’s up to the developer to add checks to avoid this kind of error.

Also note that in certain (extremely rare) cases, the fixed-point format may be more accurate than the floating-point format. In such cases, you must compare the resulting values with a more precise type (double precision or even multi-precision types such as MPFR [MPFR] [MPFR2]).

To sump up

  1. Fixed-point optimization consists in changing the representation format of real numbers from a floating-point representation, which adapts to the magnitude of the values taken by a given variable, to a fixed-point representation, which cannot adapt to the values and therefore sets a default magnitude for its representation.
  2. To define this order of magnitude, you need to know in advance what values the variables will take when the program runs.
  3. Once we can anticipate the magnitudes of the variables, we need to define the format of the fixed-point representations. For each variable, a trade-off is made between reducing execution time and the accuracy of the result.
  4. Switching to fixed point changes the nature of the assembler instructions needed to perform arithmetic operations on decimal numbers. These new instructions are shorter and consume less power.
  5. Switching to fixed point has an impact on the results of arithmetic operations involving decimal point numbers, so it’s necessary to compare the behavior of the optimized code with that of the original code.

Sources

[Yates] YATES, Randy. Fixed-point arithmetic: An introduction. Digital Signal Labs, 2009, vol. 81, no 83, p. 198.

[MPFR] FOUSSE, Laurent, HANROT, Guillaume, LEFÈVRE, Vincent, et al. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Transactions on Mathematical Software (TOMS), 2007, vol. 33, no 2, p. 13-es.

[MPFR2] GNU MPFR 4.2.1 manual , https://www.mpfr.org/mpfr-current/mpfr.html#Top

[Benjamin Barrois] Methods to evaluate accuracy-energy trade-off in operator-level approximate computing. Computer Arithmetic. Université de Rennes, 2017. English. ⟨NNT : 2017REN1S097⟩. ⟨tel-01665015v2⟩

[08] “IEEE Standard for Floating-Point Arithmetic”. In : IEEE Std 754-2008 (2008),

  1. 1-70. doi : 10.1109/IEEESTD.2008.4610935.