Undefined behavior is closer than you think

This time it’s hard to give an example from a real application. Nevertheless, I quite often see suspicious code fragments which can lead to the problems described below. This error is possible when working with large array sizes, so I don’t know exactly which project might have arrays of this size. We don’t really collect 64-bit errors, so today’s example is simply contrived.

Let’s have a look at a synthetic code example:

size_t Count = 1024*1024*1024; // 1 Gb
if (is64bit)
  Count *= 5; // 5 Gb
char *array = (char *)malloc(Count);
memset(array, 0, Count);

int index = 0;
for (size_t i = 0; i != Count; i++)
  array[index++] = char(i) | 1;

if (array[Count - 1] == 0)
  printf("The last array element contains 0.\n");

free(array);

Explanation

This code works correctly if you build a 32-bit version of the program; if we compile the 64-bit version, the situation will be more complicated.

A 64-bit program allocates a 5 GB buffer and initially fills it with zeros. The loop then modifies it, filling it with non-zero values: we use “| 1” to ensure this.

And now try to guess how the code will run if it is compiled in x64 mode using Visual Studio 2015? Have you got the answer? If yes, then let’s continue.

If you run a debug version of this program, it’ll crash because it’ll index out of bounds. At some point the index variable will overflow, and its value will become ?2147483648 (INT_MIN).

Sounds logical, right? Nothing of the kind! This is an undefined behavior, and anything can happen.

To get more in-depth information, I suggest the following links:

An interesting thing – when I or somebody else says that this is an example of undefined behavior, people start grumbling. I don’t know why, but it feels like they assume that they know absolutely everything about C++, and how compilers work.

But in fact they aren’t really aware of it. If they knew, they would’t say something like this (group opinion):

This is some theoretical nonsense. Well, yes, formally the ‘int’ overflow leads to an undefined behavior. But it’s nothing more but some jabbering. In practice, we can always tell what we will get. If you add 1 to INT_MAX then we’ll have INT_MIN. Maybe somewhere in the universe there are some exotic architectures, but my Visual C++ / GCC compiler gives an incorrect result.

And now without any magic, I will give a demonstration of UB using a simple example, and not on some fairy architecture either, but a Win64-program.

It would be enough to build the example given above in the Release mode and run it. The program will cease crashing, and the warning “the last array element contains 0” won’t be issued.

The undefined behavior reveals itself in the following way. The array will be completely filled, in spite of the fact that the index variable of int type isn’t wide enough to index all the array elements. Those who still don’t believe me, should have a look at the assembly code:

  int index = 0;
  for (size_t i = 0; i != Count; i++)
000000013F6D102D  xor         ecx,ecx  
000000013F6D102F  nop  
    array[index++] = char(i) | 1;
000000013F6D1030  movzx       edx,cl  
000000013F6D1033  or          dl,1  
000000013F6D1036  mov         byte ptr [rcx+rbx],dl  
000000013F6D1039  inc         rcx  
000000013F6D103C  cmp         rcx,rdi  
000000013F6D103F  jne         main+30h (013F6D1030h)

Here is the UB! And no exotic compilers were used, it’s just VS2015.

If you replace int with unsigned, the undefined behavior will disappear. The array will only be partially filled, and at the end we will have a message – “the last array element contains 0”.

Assembly code with the unsigned:

  unsigned index = 0;
000000013F07102D  xor         r9d,r9d  
  for (size_t i = 0; i != Count; i++)
000000013F071030  mov         ecx,r9d  
000000013F071033  nop         dword ptr [rax]  
000000013F071037  nop         word ptr [rax+rax]  
    array[index++] = char(i) | 1;
000000013F071040  movzx       r8d,cl  
000000013F071044  mov         edx,r9d  
000000013F071047  or          r8b,1  
000000013F07104B  inc         r9d  
000000013F07104E  inc         rcx  
000000013F071051  mov         byte ptr [rdx+rbx],r8b  
000000013F071055  cmp         rcx,rdi  
000000013F071058  jne         main+40h (013F071040h)

Correct code

You must use proper data types for your programs to run properly. If you are going to work with large-size arrays, forget about int and unsigned. So the proper types are ptrdiff_t, intptr_t, size_t, DWORD_PTR, std::vector::size_type and so on. In this case it is size_t:

size_t index = 0;
for (size_t i = 0; i != Count; i++)
  array[index++] = char(i) | 1;

Recommendation

If the C/C++ language rules result in undefined behavior, don’t argue with them or try to predict the way they’ll behave in the future. Just don’t write such dangerous code.

programmist1

There are a whole lot of stubborn programmers who don’t want to see anything suspicious in shifting negative numbers, comparing this with null or signed types overflowing.

Don’t be like that. The fact that the program is working now doesn’t mean that everything is fine. The way UB will reveal itself is impossible to predict. Expected program behavior is one of the variants of UB.

Written by Andrey Karpov.
This error was found with PVS-Studio static analysis tool.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s