Float or double?

computer man

Introduction

In scientific computation we use floating point numbers a lot. This article is a guide to picking the right floating point representation for you. In most programming languages there are two built-in precisions to pick from: 32-bit (single-precision) and 64-bit (double-precision). In the C family of languages these are known as float and double, and those are the names I will use in this article. There are other precisions: half, quad etc. I won’t cover these here, but a lot of the discussion makes sense for half vs float or double vs quad too. So to be clear: I will only talk about 32-bit and 64-bit IEEE 754 here.

This article is also written for you who have a lot of data. If you need a few numbers here and there, just use double and don’t think more about it!

The article is split into two separate (but related) discussions: what to use for storing you data, and what to use during calculations on your data. Sometimes you want to store your data with float but do calculations using double.

If you feel you need it, I have added a simple refresher on how floating point numbers work at the end of the article. Feel free to read that first, then come back here!

Precision of data

32-bit floats has around 24 bits ≈ 7 digits of precision, while double has 53 bits ≈ 16 decimals of precision. How much is that? Here are some rough figures on the worst-case precision you would get if you used float vs double to measure things of different ranges:

Scale float precision double precision
Size of a room micrometer radius of a proton
Circumference of earth 2.4 meters nanometer
Distance to the sun 10 km width of a human hair
Length of a day 5 milliseconds picosecond
Length of a century three minutes microsecond
Time since Big Bang millennium minute

(example: using a double you can represent the time since the big bang with a precision of around one minute).

So: if you are measuring the size of your apartment float will do fine. If you want to represent GPS coordinates with sub-meter precision, you will need double.

Why not always store everything with double?

If you have plenty of RAM and neither execution speed or battery drainage is a problem for you – you can stop reading now and just use double. Good bye and have a nice day!

If you are memory constrained, then the reason to use float over double might be as simple as “it takes half as much space”. But even if memory is not an issue, storing your data with float may be substantially faster. As I said, double takes twice the space over float, so that means it will take twice as long to allocate, initialize and copy your data if you use double. Moreover, if you are reading your data in an unpredictable fashion (random access) you will get increased cache misses with double, which will make it around 40% slower (based on the O(√N) rule of thumb and confirmed with benchmarks).

Perfomance impact of calculating using float vs double

If you have a well-trimmed pipeline using SIMD, you will be able to do twice the number of FLOPS with float vs double. If not, the difference might be much smaller, but it is very dependent on your CPU. On Intel Haswell the difference between float and double is small, but on ARM Cortex-A9 the difference is big. A comprehensive benchmark can be found here.

Of course, if you store your data with double it makes little sense to do computation on them using float. After all, why store a lot of precision if you are not going to use it? The opposite is, however, not true: it can make a lot of sense to store your data with float but perform some or all computation on them with double.

When to do computation with increased precision.

Even if you store your data with float there are places you will want to use double precision during actual calculations. Here’s a simple example in C:

float sum(float* values, long long count)
{
    float sum = 0;
    for (long long i = 0; i < count; ++i) {
        sum += values[i];
    }
    return sum;
}

If you try this code on 10 floats you will likely not see any precision problems. But if you run it on a million numbers, you most definitely will. The reason is that you will lose precision when you add a large and a small number together, and after summing a million numbers, you will probably be doing that. The rule of thumb is this: if you aggregate 10^N values, you will lose N digits of precision. So if you sum a thousand (10^3) numbers, you will lose 3 digits of precision. If you sum a million (10^6) numbers, you will lose 6 digits of precision (and float only has 7!). The solution is simple: perform the computation with double instead:

float sum(float* values, long long count)
{
    double sum = 0;
    for (long long i = 0; i < count; ++i) {
        sum += values[i];
    }
    return (float)sum;
}

Most likely, this will run almost as fast as the first case, but will do so without losing you any precision. Note that you do not need to store your numbers as double to get the benefit of doing calculations on them with increased precision!

Example

Let us say that you want to accurately measure some value, but your measuring device (which has some digital display) only gives you three significant digits. Measuring your variable ten times gives you this:

3.16, 3.15, 3.16, 3.18, 3.15, 3.11, 3.14, 3.11, 3.14, 3.15

To get a more accurate measure you decide to sum up your measurements to get an average. In this example I’ll use a base 10 floating point number with exactly seven digits of precision (similar to that of a 32-bit float). With three significant digits, that gives us four digits of extra precision:

3.160000 +
3.150000 +
3.160000 +
3.180000 +
3.150000 +
3.110000 +
3.140000 +
3.110000 +
3.140000 +
3.150000 =
31.45000

The sum now has four significant digits, with three digits to spare. What if we sum up a hundred of these values? Then we would get something like this:

314.4300

Still two digits to spare. If we sum up a thousand digits?

3140.890

How about ten thousand?

31412.87

So far so good, but now we are using all digits of precision. Let’s keep adding numbers:

31412.87 +
    3.11 =
31415.98

Notice how we shift down the smaller number to get the decimal points to align. We now have no extra precision in the numbers, and we are dangerously close to losing precision. What if we had summed a hundred thousand values? Then adding new values would look like this:

314155.6  +
     3.12 =
314158.7

Notice how the last significant digit in the data (the 2 in 3.12) is lost. This is now where the loss of precision is really bad, as we are now consistently ignoring the last digits of precision in our data. We can see that the problem started after summing ten thousands digits, but before a hundred thousand. We had seven digits of precision, and our measurements had three significant digits. That left four digits = four orders of magnitude as a “numerical buffer”. Thus we could safely add four orders of magnitude = 10000 of data points without loss of precision, but more is a problem. Thus the rule is:

If your floating points number has P digits (7 for float, 16 for double) of precision and your data has S digits of significance, you can have P-S digits of wiggle room, and can sum 10^(P-S) values without precision problems. So, had we used 16 digits of precision instead of 7, we could have summed 10^(16-3) = 10 000 000 000 000 values without any precision problems.

(There are numerically stable ways to sum many values. However, simply switching from float to doubleis much simpler, and probably faster).

Conclusions

  • Don’t use more precision than you need when storing data.
  • If you sum a lot of data, switch to a higher precision accumulator.

Appendix: What is a floating point number?

I find a lot of people don’t really grok floating point numbers, so it is worth just explaining them in brief. I will skip the nitty-gritty details of bits, infs, nans and subnormals and instead show a few examples of floating point numbers in base-10. Everything applies the same to binary.

Here are a few examples of floating point numbers, all with 7 decimals of precision (close to that of a 32-bit float).

1.875545 · 10^-18 = 0.000 000 000 000 000 001 875 545

3.141593 · 10^0 = 3.141593

2.997925 · 10^8 = 299 792 500

6.022141 · 10^23 = 602 214 100 000 000 000 000 000

The part in bold is called the mantissa, and the part in italics is called the exponent. In short, the precision goes in the mantissa, and the magnitude goes in the exponent. So how do we do operations on them? Well, multiplication is simple: multiply the mantissas, and add the exponents:

1.111111 · 10^42 · 2.000000 · 10^7

= (1.111111 · 2.000000) · 10^(42 + 7)

= 2.222222 · 10^49

Addition is more tricky: to add two numbers of different magnitude, we first need to shift the smaller of the two numbers so that the decimal points match:

3.141593 · 10^0 + 1.111111 · 10^-3 =

3.141593 + 0.0001111111 =

3.141593 + 0.000111 =

3.141704

Notice how we shift out some of the significant decimals to make the points align. In other words, we lose precision when adding numbers of different magnitudes.

Original source – www.ilikebigbits.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s