# Floating-point Numbers Aren't Real

### From WikiContent

Current revision (04:48, 8 August 2009) (edit) (undo) |
|||

(33 intermediate revisions not shown.) | |||

Line 1: | Line 1: | ||

- | Floating-point numbers are not real numbers in the mathematical sense. Real numbers have infinite precision; floating-point numbers have | + | Floating-point numbers are not "real numbers" in the mathematical sense, even though they are called ''real'' in some programming languages, such as Pascal and Fortran. Real numbers have infinite precision and are therefore continuous and non-lossy; floating-point numbers have limited precision, so they are finite, and they resemble "badly-behaved" integers, because they're not evenly spaced throughout their range. |

+ | |||

+ | To illustrate, assign 2147483647 (the largest signed 32-bit integer) to a 32-bit <code>float</code> variable (<code>x</code>, say), and print it. You'll see 2147483648. Now print <code>x - 64</code>. Still 2147483648. Now print <code>x-65</code> and you'll get 2147483520! Why? Because the spacing between adjacent floats in that range is 128, and floating-point operations round to the nearest floating-point number. | ||

+ | |||

+ | IEEE floating-point numbers are fixed-precision numbers based on base-two scientific notation: 1.d<sub>1</sub>d<sub>2</sub>...d<sub>p-1</sub> × 2<sup>e</sup>, where ''p'' is the precision (24 for <code>float</code>, 53 for <code>double</code>). The spacing between two consecutive numbers is 2<sup>1-p+e</sup>, which can be safely approximated by ε|x|, where ε is the ''machine epsilon'' (2<sup>1-p</sup>). | ||

+ | |||

+ | Knowing the spacing in the neighborhood of a floating-point number can help you avoid classic numerical blunders. For example, if you're performing an iterative calculation, such as searching for the root of an equation, there's no sense in asking for greater precision than the number system can give in the neighborhood of the answer. Make sure that the tolerance you request is no smaller than the spacing there; otherwise you'll loop forever. | ||

+ | |||

+ | Since floating-point numbers are approximations of real numbers, there is inevitably a little error present. This error, called ''roundoff'', can lead to surprising results. When you subtract nearly equal numbers, for example, the most significant digits cancel each other out, so what was the least significant digit (where the roundoff error resides) gets promoted to the most significant position in the floating-point result, essentially contaminating any further related computations (a phenomenon known as ''smearing''). You need to look closely at your algorithms to prevent such ''catastrophic cancellation''. To illustrate, consider solving the equation ''x<sup>2</sup> - 100000x + 1 = 0'' with the quadratic formula. Since the operands in the expression ''-b + sqrt(b<sup>2</sup?> - 4)'' are nearly equal in magnitude, you can instead compute the root ''r<sub>1</sub> = -b - sqrt(b<sup>2</sup> - 4)'', and then obtain ''r<sub>2</sub> = 1/r<sub>1</sub>'', since for any quadratic equation, ''ax<sup>2</sup> + bx + c = 0'', the roots satisfy ''r<sub>1</sub>r<sub>2</sub> = c/a''. | ||

+ | |||

+ | Smearing can occur in even more subtle ways. Suppose a library naively computes ''e<sup>x</sup>'' by the formula ''1 + x + x<sup>2</sup>/2 + x<sup>3</sup>/3! + ...''. This works fine for positive ''x'', but consider what happens when ''x'' is a large negative number. The even-powered terms result in large positive numbers, and subtracting the odd-powered magnitudes will not even affect the result. The problem here is that the roundoff in the large, positive terms is in a digit position of much greater significance than the true answer. The answer diverges toward positive infinity! The solution here is also simple: for negative ''x'', compute ''e<sup>x</sup> = 1/e<sup>|x|</sup>''. | ||

+ | |||

+ | It should go without saying that you shouldn't use floating-point numbers for financial appliction's — that's what decimal classes in languages like Python and C# are for. Floating-point numbers are intended for efficient scientific computation. But efficiency is worthless without accuracy, so remember the source of rounding errors and code accordingly! | ||

+ | |||

+ | By Chuck Allison |

## Current revision

Floating-point numbers are not "real numbers" in the mathematical sense, even though they are called *real* in some programming languages, such as Pascal and Fortran. Real numbers have infinite precision and are therefore continuous and non-lossy; floating-point numbers have limited precision, so they are finite, and they resemble "badly-behaved" integers, because they're not evenly spaced throughout their range.

To illustrate, assign 2147483647 (the largest signed 32-bit integer) to a 32-bit `float`

variable (`x`

, say), and print it. You'll see 2147483648. Now print `x - 64`

. Still 2147483648. Now print `x-65`

and you'll get 2147483520! Why? Because the spacing between adjacent floats in that range is 128, and floating-point operations round to the nearest floating-point number.

IEEE floating-point numbers are fixed-precision numbers based on base-two scientific notation: 1.d_{1}d_{2}...d_{p-1} × 2^{e}, where *p* is the precision (24 for `float`

, 53 for `double`

). The spacing between two consecutive numbers is 2^{1-p+e}, which can be safely approximated by ε|x|, where ε is the *machine epsilon* (2^{1-p}).

Knowing the spacing in the neighborhood of a floating-point number can help you avoid classic numerical blunders. For example, if you're performing an iterative calculation, such as searching for the root of an equation, there's no sense in asking for greater precision than the number system can give in the neighborhood of the answer. Make sure that the tolerance you request is no smaller than the spacing there; otherwise you'll loop forever.

Since floating-point numbers are approximations of real numbers, there is inevitably a little error present. This error, called *roundoff*, can lead to surprising results. When you subtract nearly equal numbers, for example, the most significant digits cancel each other out, so what was the least significant digit (where the roundoff error resides) gets promoted to the most significant position in the floating-point result, essentially contaminating any further related computations (a phenomenon known as *smearing*). You need to look closely at your algorithms to prevent such *catastrophic cancellation*. To illustrate, consider solving the equation *x ^{2} - 100000x + 1 = 0* with the quadratic formula. Since the operands in the expression

*-b + sqrt(b*are nearly equal in magnitude, you can instead compute the root

^{2}- 4)*r*, and then obtain

_{1}= -b - sqrt(b^{2}- 4)*r*, since for any quadratic equation,

_{2}= 1/r_{1}*ax*, the roots satisfy

^{2}+ bx + c = 0*r*.

_{1}r_{2}= c/aSmearing can occur in even more subtle ways. Suppose a library naively computes *e ^{x}* by the formula

*1 + x + x*. This works fine for positive

^{2}/2 + x^{3}/3! + ...*x*, but consider what happens when

*x*is a large negative number. The even-powered terms result in large positive numbers, and subtracting the odd-powered magnitudes will not even affect the result. The problem here is that the roundoff in the large, positive terms is in a digit position of much greater significance than the true answer. The answer diverges toward positive infinity! The solution here is also simple: for negative

*x*, compute

*e*.

^{x}= 1/e^{|x|}It should go without saying that you shouldn't use floating-point numbers for financial appliction's — that's what decimal classes in languages like Python and C# are for. Floating-point numbers are intended for efficient scientific computation. But efficiency is worthless without accuracy, so remember the source of rounding errors and code accordingly!

By Chuck Allison