Monday, April 25, 2016

Dangling pointers and undefined behavior

My previous blog post "C pointers are not hardware pointers" says that dangling pointers may invoke undefined behavior even if they are not dereferenced, as in the example
void foo(int *p, int *q)
{
free(p);
if (p == q)  // Undefined behavior!
bar();
}

I have got lots of questions and comments of the form "This cannot be right — the comparison must be valid as p is not modified by free(p)", so this blog post provides more details.

That the behavior is undefined follows from C11 6.2.4 "Storage durations of objects"
The lifetime of an object is the portion of program execution during which storage is guaranteed to be reserved for it. An object exists, has a constant address, and retains its last-stored value throughout its lifetime. If an object is referred to outside of its lifetime, the behavior is undefined. The value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime.
and 7.22.3 "Memory management functions" that says that free ends the lifetime of objects
The lifetime of an allocated object extends from the allocation until the deallocation.

The reason for the standard to make this undefined behavior is that even simple operations on dangling pointers, such as assignment or comparison, may misbehave and result in exceptions or arbitrary values on some architectures.1 The C99 rationale has an example of how this can happen
Consider a hypothetical segmented architecture on which pointers comprise a segment descriptor and an offset. Suppose that segments are relatively small so that large arrays are allocated in multiple segments. While the segments are valid (allocated, mapped to real memory), the hardware, operating system, or C implementation can make these multiple segments behave like a single object: pointer arithmetic and relational operators use the defined mapping to impose the proper order on the elements of the array. Once the memory is deallocated, the mapping is no longer guaranteed to exist. Use of the segment descriptor might now cause an exception, or the hardware addressing logic might return meaningless data.

1. At least on architectures that existed when the first version of the standard was written.

Sunday, April 10, 2016

Why volatile is hard to specify and implement

Volatile in C

The main use case for volatile in C is memory mapped hardware where reads and writes may change the hardware state (even reads may change the state — it is common that e.g. a read of an interrupt register resets the value). This means that the compiler must not eliminate, introduce, or reorder memory accesses to such memory.

Real world hardware have restrictions that makes it is hard to specify access rules in a way that works on all hardware. A few examples:
• The original DEC Alpha processor could only do 32-bit load/store operations, so access to an 8-bit variable must be done by reading 32 bits, modifying the value, and writing back the result. This means that writing one element in
volatile char a[4];

will read and write all the elements.
• Unaligned load/store operations are split into two operations on the bus.
• The hardware may have restrictions on how read-modify-write instructions are used for memory mapped hardware, so e.g. incrementing a volatile variable as a++ may need to be done as a load, an addition, and a store, even if the CPU has an instruction that increments a value in memory.
The standard solves these problems by saying that accesses to volatile objects need to be done as in the source code, but it is up to the implementation to define what constitutes an "access". The text of this is in C11 6.7.3 "Type qualifiers":
An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.
This means that an implementation can do essentially anything for volatile as long as it can cover this by a definition of "access", but even a reasonable definition may produce surprises. For example, the text above talks about "at every sequence point". This means that the compiler is allowed to optimize
volatile int *p = somevalue;
int s = *p + *p;

as if it said
volatile int *p = somevalue;
int tmp = *p;
int s = tmp + tmp;

as both reads of *p are between sequence points. A similar case is when the value of an assignment expression is used, such as
volatile int a, b;
a = b = 0;

The standard does not say if b is read, i.e. the compiler is allowed to compile this as
b = 0; a = 0;
or
b = 0; a = b;

The examples above are probably not that common, but even trivial cases may be surprising. Consider the load *p in
volatile int *p = somevalue;
*p;

C++ compilers must not generate a load for this case (because of the C++ rules regarding lvalue-to-rvalue conversion). Many developers argue that C compilers must generate a load, but I'm not completely convinced. And, for example, the GCC 5.3.0 manual describes this as if the compiler has a choice:
According to the C standard, such an expression is an rvalue whose type is the unqualified version of its original type, i.e. int. Whether GCC interprets this as a read of the volatile object being pointed to or only as a request to evaluate the expression for its side-effects depends on this type.
If it is a scalar type, or on most targets an aggregate type whose only member object is of a scalar type, or a union type whose member objects are of scalar types, the expression is interpreted by GCC as a read of the volatile object; in the other cases, the expression is only evaluated for its side-effects.
Anyway, many C compilers do not generate the load, so you should not rely on this construct, regardless of what is the right interpretation of the standard.

Using volatile to limit optimization

The C standard specifies that volatile accesses will not be reordered or removed, but CPUs have things like write buffers that reorder and removes writes. That functionality is usually disabled for the memory regions that contain memory mapped hardware, but it is enabled for the stack. This always confuses me when I see local variables being volatile (when not needed by longjmp) — why is it OK for the CPU to reorder memory operations when the compiler must not do it? What did the developer expect to achieve by adding volatile?

The idea is usually to use volatile to disable compiler optimizations. A typical example is cryptographic code that want to clear sensitive data from memory. The "obvious" way of clearing memory does not work
void getPassword(void) {
char pwd[64];
/* Checking of password, secure operations, etc. */
}
memset(pwd, 0, sizeof(pwd));
}

as the compiler may optimize away the memset, and the usual suggestion is to change the memset to something like
volatile char *p = pwd;
for (size_t i = 0; i < sizeof(pwd); ++i)
p[i] = 0;

This may generate store instructions, but it is not guaranteed to clear the memory! Consider for example a hypothetical architecture that invalidates the cache for the stack frame when popping the stack, instead of wasting power and memory bandwidth by writing the now useless data to the memory — the cache will be invalidated before the newly written zeros have reached main memory.

One other problem is that the compiler may have placed sensitive data in "anonymous" temporary space allocated on the stack (e.g. by spilling), in registers etc., as is discussed in "Zeroing buffers is insufficient":
I am absolutely certain that there is software out there which inadvertantly keeps an AES key sitting in an XMM register long after it has been wiped from memory.
It may still be prudent to try to clear the memory (and it does not hurt), but be aware that it is not guaranteed to help either...

Volatile and real world compilers

Compilers tend to have much code handling special cases. There are several reasons, such as
• Different operations can be optimized in different ways
• General optimizations are often expensive, so it makes sense to handle cheap cases separately in order to reduce the load on the general optimization pass, or to be able to do the cheaper cases more often
• The compiler may need different optimization strategies depending on surrounding code (e.g. depending on register pressure)
• Different target CPUs need different optimization strategies
It is easy to forget to check for "volatile" in some obscure corner case, and it is hard test in a way that finds all such problems, so it is not that uncommon to find cases where volatile accesses are reordered/removed.

One example of this is discussed in OpenSSL pull request #102:
#include <stddef.h>

static void memset_s(void * v, size_t n) {
volatile unsigned char * p = (volatile unsigned char *)v;
for (size_t i = 0; i < n; ++i) {
p[i] = 0;
}
}

void f1() {
unsigned char x[4];
memset_s(x, sizeof x);
}

void f2() {
unsigned char x[7];
memset_s(x, sizeof x);
}

Current GCC optimizes away the memset_s in f1, but keeps it in f2.

Both functions are optimized the same by the high-level optimizations — the memset_s is inlined, and the loop unrolled:
f1 ()
{
unsigned char x[4];

<bb 2>:
MEM[(volatile unsigned char *)&x] ={v} 0;
MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
x ={v} {CLOBBER};
return;
}

f2 ()
{
unsigned char x[7];

<bb 2>:
MEM[(volatile unsigned char *)&x] ={v} 0;
MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
MEM[(volatile unsigned char *)&x + 4B] ={v} 0;
MEM[(volatile unsigned char *)&x + 5B] ={v} 0;
MEM[(volatile unsigned char *)&x + 6B] ={v} 0;
x ={v} {CLOBBER};
return;
}

The difference is introduced when the GIMPLE IR above is lowered to the representation used for instruction selection: the compiler sees that x[4] is an int-sized non-volatile array on the stack, where all accesses are using known indices, and this is a special case where it is profitable to place the array in a register (that is then eliminated as its value is never used). The problem is that this is not a valid transformation unless all accesses are non-volatile, and the code forgot to check for this...