Floating-point arithmetic – Part 1

This three-part series covers some basic properties of floating-point artithmetic.  Part 1 covers the basics of floating-point numbers, how they are represented, how they are stored and how certain special values are encoded.  The second part presents a simplified floating-point system that illkustrates one of the key issues with floating point numbers, namely the impact of the order in which addition of multiple floating-point numbers is carried out.  In the third and final part some real cases are presented.

Floating-point number basics

Real numbers are ubiquitous in scientific computing. In general they cannot be represented exactly on a computer. Instead, they are represented as floating-point numbers. In this presentation we will use the terminology and nomenclature set forth in  the IEEE 754 Standard. The term ‘floating’ refers to the fact that the number of decimals can vary. This is in contrast to fixed-point numbers, which sometimes are used in accounting.

Definition of floating-point numbers

Floating-point numbers represent real numbers of the following form:
x = (-1)^s2^E\sum_{i=0}^{p-1}b_i2^{-i} \equiv (-1)^s2^E(b_0\cdot b_1b_2\cdots b_{p-1})

\[\begin{array}{ll} s & = 0 \mbox{ or } 1 \\
E & = j, \mbox{  } E_{\min} \leq j \leq E_{\max} \\
b_i & = 0 \mbox{ or } 1, i = 0, \ldots, p-1

It should be emphasized that the above floating-point definition states that \((-1)^s2^E(b_0\cdot b_1b_2\cdots b_{p-1})\) is the floating-point number whose value is given by

The value of a floating-point number is always a real number. We will use the same symbol to denote both the floating-point number and its corresponding value. It is important to keep in mind that floating-point numbers are subject to floating-point arithmetic, whereas values obey the rules of standard arithmetic.

Representation of floating-point numbers

From the previous definition  it is evident that the space of floating-point numbers is completely determined by the three quantities \(p\), \(E_{\min}\) and \(E_{\max}\). For single and double precision floating-point numbers the values are given in table below.  Floating-point numbers could then be represented by storing \(s\), \(E\) and \(b_0\cdot b_1b_2\cdots b_{p-1}\) in 32-bit or 64-bit structures.

ParameterSingle PrecisionDouble Precision
Exponent width in bits811
Format width in bits3264

In reality, however, the bit string \(b_0b_1 \cdots b_{p-1}\) (the “mantissa”) is normalized such that
b_0 = 1
Storing \(b_0\) is thus superfluous since it is always 1; only
f \equiv \cdot b_1b_2\cdots b_{p-1}
needs to be stored. The quantity \(f\) is known as the fraction. Furthermore, the exponent \(E\) is shifted so as to make it positive. This is accomplished by the introduction of a biased exponent
e \equiv E + \mbox{bias}
where \(\mbox{bias} = 127\) in case of single precision or \(\mbox{bias} = 1023\) for double precision.

The format of single and double precision floating-point numbers is depicted in the figures below.


Floating-point number values

From the definition of the biased exponent \(e\) and previous tables it follows that there are two more values of \(e\) that can be accommodated for:

single precision:\(e = 0\)\(e = 255\)
double precision:\(e = 0\)\(e = 2047\)

These “extra” values are used to encode \(0\), \(\infty\), NaN (Not-a-Number) and denormalized floating-point numbers (\(b_0 = 0\)). The NaN value is used to signal illegal operations such as \(0/0\), \(0*\infty\) and \(\infty – \infty\), which are all undefined from a mathematical point of view.

Let \(x\) be a single precision floating-point number.  Its value is then derived as

x = \mbox{NaN} & \mbox{ if }e = 255, f \neq 0 \\
x = (-1)^s\infty & \mbox{ if }e = 255, f = 0 \\
x = (-1)^s2^{e-127}(1 + \sum^{23}_{i=1}f_i2^{-i}) & \mbox{ if }0 < e < 255 \\
x = (-1)^s2^{-126}(\sum^{23}_{i=1}f_i2^{-i}) & \mbox{ if }e = 0, f \neq 0 \\
x = 0 & \mbox{ if }e = 0, f = 0

It follows immediately that the smallest and largest positive single precision floating-point numbers have the values
x_{\min} = 2^{-149} \\
x_{\max} = 2^{128}(1 – 2^{-24})

The value of a double precision floating-point number is computed analogously.

Leave a Reply

Your email address will not be published. Required fields are marked *