Floating-point arithmetic – Part 3

Practical consequences

Expressions like

s = \sum_{i = 1}^n x_i

are very common in scientific computing. They are known as accumulations. In exact arithmetic it doesn’t matter in what order the elements are accumulated. For floating-point numbers the ordering does matter as we have seen in the previous section.  The naive accumulation algorithm would be

s = 0
for i = 1:n
    s = s + x(i);          % accumulation

Changing the order in which the elements \(x_i\) are accumulated may change the final result, if only by so little. For general data it is of course virtually  impossible to know beforehand which ordering that will result in the most accurate result. It is possible to improve upon the overall numerical performance by using so called compensated summation.  This was first proposed by Kahan. The idea is to add a (small) compensation term \(e\) to each element \(x_i\) to be added to the partial sum \(s\) to alleviate the effects of truncation errors during accumulation.

s = 0
e = 0
for  i = 1:n
     t = s;                % temporary storage
     y = x(i) + e;         % add correction
     s = t + y;            % accumulation
     e = (t - s) + y;      % new correction

In exact arithmetic the correction \(e\) will always be zero, in which case the above algorithm reduces to the naive algorithm. Note that the parentheses are necessary in the expression for \(e\) to indicate the preferred order.  Kahan’s algorithm does not make floating-point rounding errors go away, but the overall numerical behavior is improved compared to the direct accumulation method.

The Vancouver Stock Exchange index

The intricacies of floating-point arithmetic are not only of academic interest. We have not touched upon the round-off method. It is beyond the scope of this presentation to go into details. In our simplified example in a previous blog post we tacitly assumed round-off via truncation (or chopping).  Another alternative would be round-off to the nearest floating-point number. It is important to understand which model is used on a particular system, since it will have an impact on the final result. The following real-world example shows what can happen. The initial value of the index of the Vancouver Stock Exchange was set to 1,000 in January 1982. Towards the end of 1983 the index had slipped to 520. This did not correspond to economic reality, however. By switching to round-to-nearest and recalculating the index it almost doubled in value, which was in much better agreement
with the actual economic situation. The IEEE 754 standard prescribes two rounding modes:

  • Round-to-nearest. This is the default mode
  • Directed rounding (round towards zero, plus or minus infinity)

FMA enabled hardware

Accumulation algorithms often come with HW support. Both Intel and AMD support fused multiply-add or FMA for short. The core operation is
d = \mbox{round}(a*b + c)
In FMA3 the above operation is carried out using three registers. Upgrading from one processor model to another – everything else being left intact – can potentially change the result of floating-point arithmetic operations. The example below is taken from a real-world case.

A compute kernel had been running in Azure VMs based on the Gen 3 Sandy Bridge processor architecture (Intel E5-2660). This architecture does not support FMA3. The underlying HW infrastructure was then upgrded to the newer Gen 4 Haswell processor architecture (Intel E5-2673), which supports FMA3. The compute runtime and kernel were left intact, however.  No changes were made.  When comparing the results obtained on the different HW platforms one of the test cases showed a small difference. It turned out that the “culprit” was the FMA3 support, which introduced a new way – presumably more accurate – of doing accumulations in HW.  The difference is not an error per se, but a consequence of the non associative nature of floating-point arithmetic. The “error” is the difference itself. Once the root cause was identified, the solution was to turn off the FMA3 feature on the new platform by recompiling the binaries with the FMA switch _set_FMA_enable(0). This change resulted in identical numerical results on both processor architectures.

Compiler settings

We have in our previous examples seen that rearranging floating-point operations may sometimes cause undesirable effects. Porting a serial code to a parallel version will rearrange the order in which arithmetic happens, especially reductions like summations and accumulations. One should not expect to see identical output of the floating-point operations when comparing output from the serial and the parallel code.

Optimizing compilers may pose another challenge. In order to produce fast code, some compiler settings (default or explicit) could potentially rearrange the computation in order to arrive at a more efficient stack management, faster memory references, etc. It is clear from the previous discussion that these optimizations may have unwanted side effects. To aid the programmer, there are certain compiler switches that help control the flow of execution. For instance, the Intel C compiler has a _fp_model switch.  By default the switch allows value-unsafe optimizations, that is, rearrangements that could result in a different numerical result. It is possible to specify levels ranging from very unsafe to strict. ‘Very unsafe’ would correspond to optimizing for speed. ‘Strict’ mode the implies the following:

  • Value-safe optimizations
  • Floating-point exception semantics
  • Disabled FMA
  • Disabled vectorization of reductions (OpenMP and MPI reductions are not impacted)

We have seen through a concrete example that re-association, \(x + (y + z) \longrightarrow (x + y) + z\), is value-unsafe. Other value-unsafe operations include:

  • Zero folding: \(x + 0 \longrightarrow x\), \(x*0 \longrightarrow 0\)
  • Multiplication by reciprocal: \(x / y \longrightarrow x * (1/y)\)
  • Distributive law: \(x * (y + z) \longrightarrow x * y + x * z\)
  • Flush-to-zero (FTZ): Substitute denormalized floating-point numbers by zero. FTZ allows denormalized numbers to be flushed to zero, but it does not  guarantee that they be flushed


It is up to the programmer to use the compiler switches judiciously to strike the right balance between accuracy and performance. The compiler user guide will have the detailed information, see for instance floating-point  optimizations offered by the Visual C++ compiler.

Leave a Reply

Your email address will not be published. Required fields are marked *