## Introduction

In scientific computation we use floating point numbers a lot. This article is a guide to picking the *right* floating point representation for you. In most programming languages there are two built-in precisions to pick from: 32-bit (single-precision) and 64-bit (double-precision). In the C family of languages these are known as `float`

and `double`

, and those are the names I will use in this article. There are other precisions: `half`

, `quad`

etc. I won’t cover these here, but a lot of the discussion makes sense for `half`

vs `float`

or `double`

vs `quad`

too. So to be clear: I will only talk about 32-bit and 64-bit IEEE 754 here.

This article is also written for you who have a lot of data. If you need a few numbers here and there, just use `double`

and don’t think more about it!

The article is split into two separate (but related) discussions: what to use for *storing* you data, and what to use during *calculations* on your data. Sometimes you want to store your data with `float`

but do calculations using `double`

.

If you feel you need it, I have added a simple refresher on how floating point numbers work at the end of the article. Feel free to read that first, then come back here!

## Precision of data

32-bit floats has around 24 bits ≈ 7 digits of precision, while double has 53 bits ≈ 16 decimals of precision. How much is that? Here are some rough figures on the worst-case precision you would get if you used `float`

vs `double`

to measure things of different ranges:

Scale | `float` precision |
`double` precision |
---|---|---|

Size of a room | micrometer | radius of a proton |

Circumference of earth | 2.4 meters | nanometer |

Distance to the sun | 10 km | width of a human hair |

Length of a day | 5 milliseconds | picosecond |

Length of a century | three minutes | microsecond |

Time since Big Bang | millennium | minute |

(example: using a `double`

you can represent the time since the big bang with a precision of around one minute).

So: if you are measuring the size of your apartment `float`

will do fine. If you want to represent GPS coordinates with sub-meter precision, you will need `double`

.

## Why not always store everything with `double`

?

If you have plenty of RAM and neither execution speed or battery drainage is a problem for you – you can stop reading now and just use `double`

. Good bye and have a nice day!

If you are memory constrained, then the reason to use `float`

over `double`

might be as simple as “it takes half as much space”. But even if memory is not an issue, storing your data with `float`

may be substantially faster. As I said, `double`

takes twice the space over `float`

, so that means it will take twice as long to allocate, initialize and copy your data if you use `double`

. Moreover, if you are reading your data in an unpredictable fashion (random access) you will get increased cache misses with `double`

, which will make it around 40% slower (based on the O(√N) rule of thumb and confirmed with benchmarks).

## Perfomance impact of calculating using `float`

vs `double`

If you have a well-trimmed pipeline using SIMD, you will be able to do twice the number of FLOPS with `float`

vs `double`

. If not, the difference might be much smaller, but it is very dependent on your CPU. On Intel Haswell the difference between `float`

and `double`

is small, but on ARM Cortex-A9 the difference is big. A comprehensive benchmark can be found here.

Of course, if you store your data with `double`

it makes little sense to do computation on them using `float`

. After all, why store a lot of precision if you are not going to use it? The opposite is, however, not true: it can make a lot of sense to store your data with `float`

but perform some or all computation on them with `double`

.

## When to do computation with increased precision.

Even if you store your data with `float`

there are places you will want to use `double`

precision during actual calculations. Here’s a simple example in C:

float sum(float* values, long long count) { float sum = 0; for (long long i = 0; i < count; ++i) { sum += values[i]; } return sum; }

If you try this code on 10 floats you will likely not see any precision problems. But if you run it on a million numbers, you most definitely will. The reason is that you will lose precision when you add a large and a small number together, and after summing a million numbers, you will probably be doing that. The rule of thumb is this: if you aggregate *10^N* values, you will lose *N* digits of precision. So if you sum a thousand (*10^3*) numbers, you will lose 3 digits of precision. If you sum a million (*10^6*) numbers, you will lose 6 digits of precision (and float only has 7!). The solution is simple: perform the computation with `double`

instead:

float sum(float* values, long long count) { double sum = 0; for (long long i = 0; i < count; ++i) { sum += values[i]; } return (float)sum; }

Most likely, this will run almost as fast as the first case, but will do so without losing you any precision. Note that you do not need to store your numbers as `double`

to get the benefit of doing calculations on them with increased precision!

### Example

Let us say that you want to accurately measure some value, but your measuring device (which has some digital display) only gives you three significant digits. Measuring your variable ten times gives you this:

3.16, 3.15, 3.16, 3.18, 3.15, 3.11, 3.14, 3.11, 3.14, 3.15

To get a more accurate measure you decide to sum up your measurements to get an average. In this example I’ll use a base 10 floating point number with exactly seven digits of precision (similar to that of a 32-bit `float`

). With three significant digits, that gives us four digits of extra precision:

3.160000 + 3.150000 + 3.160000 + 3.180000 + 3.150000 + 3.110000 + 3.140000 + 3.110000 + 3.140000 + 3.150000 = 31.45000

The sum now has four significant digits, with three digits to spare. What if we sum up a hundred of these values? Then we would get something like this:

314.4300

Still two digits to spare. If we sum up a thousand digits?

3140.890

How about ten thousand?

31412.87

So far so good, but now we are using all digits of precision. Let’s keep adding numbers:

31412.87 + 3.11 = 31415.98

Notice how we shift down the smaller number to get the decimal points to align. We now have no extra precision in the numbers, and we are dangerously close to losing precision. What if we had summed a hundred thousand values? Then adding new values would look like this:

314155.6 + 3.12 = 314158.7

Notice how the last significant digit in the data (the *2* in *3.12*) is lost. This is now where the loss of precision is really bad, as we are now consistently ignoring the last digits of precision in our data. We can see that the problem started after summing ten thousands digits, but before a hundred thousand. We had seven digits of precision, and our measurements had three significant digits. That left four digits = four orders of magnitude as a “numerical buffer”. Thus we could safely add four orders of magnitude = *10000* of data points without loss of precision, but more is a problem. Thus the rule is:

If your floating points number has *P* digits (*7* for `float`

, *16* for `double`

) of precision and your data has *S* digits of significance, you can have *P-S* digits of wiggle room, and can sum *10^(P-S)* values without precision problems. So, had we used 16 digits of precision instead of 7, we could have summed *10^(16-3) = 10 000 000 000 000* values without any precision problems.

(There are numerically stable ways to sum many values. However, simply switching from `float`

to `double`

is much simpler, and probably faster).

## Conclusions

- Don’t use more precision than you need when storing data.
- If you sum a lot of data, switch to a higher precision accumulator.

## Appendix: What is a floating point number?

I find a lot of people don’t really grok floating point numbers, so it is worth just explaining them in brief. I will skip the nitty-gritty details of bits, infs, nans and subnormals and instead show a few examples of floating point numbers in base-10. Everything applies the same to binary.

Here are a few examples of floating point numbers, all with 7 decimals of precision (close to that of a 32-bit `float`

).

**1.875545** · 10^*-18* = 0.000 000 000 000 000 00**1 875 545**

**3.141593** · 10^*0* = **3.141593**

**2.997925** · 10^*8* = **299 792 5**00

**6.022141** · 10^*23* = **602 214 1**00 000 000 000 000 000

The part in **bold** is called the mantissa, and the part in *italics* is called the exponent. In short, the precision goes in the mantissa, and the magnitude goes in the exponent. So how do we do operations on them? Well, multiplication is simple: multiply the mantissas, and add the exponents:

**1.111111** · 10^*42* · **2.000000** · 10^*7*

= (**1.111111** · **2.000000**) · 10^*(42 + 7)*

= **2.222222** · 10^*49*

Addition is more tricky: to add two numbers of different magnitude, we first need to shift the smaller of the two numbers so that the decimal points match:

**3.141593** · 10^*0* + **1.111111** · 10^*-3* =

**3.141593** + 0.000**1111111** =

**3.141593** + **0.000111** =

**3.141704**

Notice how we shift out some of the significant decimals to make the points align. In other words, we lose precision when adding numbers of different magnitudes.

*Original source – www.ilikebigbits.com*