Thursday, May 19, 2016

Type-based alias analysis in C

There was some concern that type-based alias analysis could cause problems when it was implemented in GCC, but the conclusion (as expressed in the GCC development list) was that all other compilers already did this optimization,1 so most code had already been fixed:
There's less and less of this code IMHO — modern compilers have been doing this kind of alias analysis for some time and as a result folks have been forced to fix their code. Of course this doesn't apply to Linux and some other free software projects since they only use gcc.
I guess that mail was a bit too optimistic — it was written in 1998, and I still see lots of broken code and complaints about -fstrict-aiasing...

What type-based alias analysis means

An informal description of the type-based alias analysis rules is that every memory location has a type, and you are only allowed to access the memory using the "correct" type (the compiler can, therefore, assume that two accesses with incompatible types do not alias). The C11 standard describes this in 6.5 "Expressions":
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
  • a type compatible with the effective type of the object,
  • a qualified version of a type compatible with the effective type of the object,
  • a type that is the signed or unsigned type corresponding to the effective type of the object,
  • a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.
As a concrete example, a variable of type int can only be accessed through an int* (including unsigned int*, const int*, etc.) or through a char* (including unsigned char*, const char*, etc.).

These restrictions allow the compiler to reorder operations when the types ensure that they access different objects. Consider the function
int i;

void foo(float *f)
{
    i = 23;
    *f = 0.0;
    i = i + 19;
}
*f cannot modify i as it has a different type, so the compiler is allowed to move the store to i over the store to *f, and the function is optimized to
void foo(float *f)
{
    *f = 0.0;
    i = 42;
}

Note that the type-based aliasing rules only talk about how to access objects, and not about pointer casting — it is allowed to cast pointers between incompatible types (as long as you follow the rules for pointer casts) such as
int i;
float *f = (float *)&i;
But accessing this as *f will now invoke undefined behavior, as the object pointed to is an int, and it is not allowed to access it by an expression having a float type.

type punning — union

There are cases where you must access data using an incorrect type, and there are a few ways to do this. The usual cases are where you need to get the bitwise representation of some value, and we will consider the example from the Wikipedia article on type punning that negates a floating point number by changing the most significant bit. The examples will assume 32-bit int and float.

The naive version of just casting the a pointer to int* does not work
bool is_negative(float x)
{
    unsigned int *ui = (unsigned int *)&x;
    return (*ui & 0x80000000u) != 0; // Undef behavior reading float as int
}
as it breaks the type-based aliasing rules. The best way to solve this is in most cases to use a union2
bool is_negative(float x)
{
    union
    {
        unsigned int ui;
        float f;
    } u;
    u.f = x;  
    return (u.ui & 0x80000000u) != 0;
}
Both GCC and LLVM are smart enough to generate as efficient code as you would have expected from the invalid version above.

The union trick requires that all accesses are done through the union — the result is not defined when accessing through a pointer, even if the pointer has the "correct" type
bool is_negative(float x)
{
    union
    {
        unsigned int ui;
        float f;
    } u;
    u.f = x;
    unsigned int *ui = &u.ui;
    return (*ui & 0x80000000u) != 0;  // Undefined behavior
}

type punning — character pointer

Character pointers can be used to access any type, so the is_negative function can be implemented as
bool is_negative(float x)
{
    unsigned char *p = (unsigned char *)&x;
    return (p[3] & 0x80) != 0;
}
assuming a little-endian architecture.

Note that int8_t is not guaranteed to be of character type. That is, the following function may be invalid
bool is_negative(float x)
{
    uint8_t *p = (uint8_t *)&x;
    return (p[3] & 0x80) != 0;  // Possible undefined behavior
}
Treating int8_t as a character type is the reasonable thing to do, and I would assume all compilers do this. But there are developers that think this is a bug — see the discussion in GCC bug 66110...

type punning — memcpy

A third way to do the type punning is using memcpy
bool is_negative(float x)
{
    unsigned int ui;
    memcpy(&ui, &x, 4);
    return (ui & 0x80000000u) != 0;
}
Both GCC and LLVM are smart enough to optimize away the memcpy, and generate similar code as the version using a union.

This type punning does only work if the destination is a variable — you cannot use malloc:ed memory for this. The reason is that memcpy copies the effective type from its source when writing to allocated memory
bool is_negative(float x)
{
    unsigned int *p = malloc(4);
    if (p == NULL)
        abort();
    memcpy(p, &x, 4);                // Effective type of *p is now float
    return (*p & 0x80000000u) != 0;  // Undef behavior reading float as int
}

allocated memory

Memory returned from malloc does not have a type, so each memory location gets an effective type when it is written. Subsequent reads must then be done according to the type-based aliasing rules as usual.

The type of the allocated memory can be updated by writing with a new type
void *p = malloc(4);
if (p == NULL)
    abort();
*(float *)p = 1.0;      // Effective type is float
do_something(p);
*(int *)p = 0;          // Effective type is now int
which allows the buffer being used for different things over its lifetime.

This may have some surprising effects, such as the examples in GCC bug 69776 and 70484. Consider the function
int f(int *pi, long *pl)
{
    *pi = 1;
    *pl = 0;
    return *(char *)pi;
}
The type-based aliasing rules say that int and long cannot alias, and this function can be optimized to
int f(int *pi, long *pl)
{
    *pi = 1;
    *pl = 0;
    return 1;
}
But it is possible for the pointers to alias if both point to the same malloc:ed memory, so the following will print a different value depending on if the optimization is done or not
int main(void)
{
    void *p = malloc(sizeof(long));
    if (p == NULL)
        abort();
    printf("%d\n", f(p, p));
    return 0;
}
It is a bit unclear exactly what the C standard requires in these cases, and you can argue that the optimization is invalid in this case. Recent versions of GCC are conservative and do not optimize this, but older versions are more aggressive, so it prudent to try to avoid playing tricks with type changes in allocated memory regardless of what the standard says...


1 Many developers seem to think that type-based aliasing was introduced by C99, but that is not true; C90 has essentially the same aliasing rules (although C99 contain some minor clarifications/improvements). I guess the reason for the belief is that GCC added -fstrict-aliasing at roughly the same time as it implemented C99.
2 Many discussions of type punning (such as the Wikipedia article) says that type punning through a union is a GCC extension. The background of this is that C90 said
[...] if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined.
I believe the committee intended it to work, but made it "implementation-defined" as the concrete result depends on the implementation (byte order, trap representations, etc.). But "implementation-defined" lets the compiler do whatever it wants, as long as the behavior is documented (and the original implementation in GCC did, in fact, have additional restrictions, although that was fixed before the release). GCC documents this works for C90 too, so it is in some sense a GCC extension...

9 comments:

  1. It is worth mentioning that in C++ accessing the non-active member of an union is undefined-behavior: http://stackoverflow.com/questions/11373203/accessing-inactive-union-member-undefined-behavior

    So alias analysis for C has to be different in this regard than alias analysis for C++.

    ReplyDelete
  2. Hello! Found your article via Reddit. Wonderful analysis, thank you!
    I have one question, though I'm not sure if it is related to the original topic. How about 'struct sockaddr*' casts and similar? How about pseudo-OOP in C, where we may put one structure into the other (which contains some magic which helps to determine the actual type so that we could provide an API that takes a pointer to an inner struct)? Does this sort of tricks invokes UB too?

    P.S. I know, we do have a brand new 'struct sockaddr_storage', which may include a union inside, whose fields we may use. But almost all books and resources give no care about it, just casting pointer to sockaddr_{ANYTHING} to good ol' pointer to sockaddr.

    And thank you for article again!

    ReplyDelete
  3. You are right that 'struct sockaddr*' and pseudo-OOP in C are problematic. It works as long as the structure only is accessed using memcpy etc., but nobody does that, so the code does in general invoke UB (but the compiler does in most cases not see more than one usage at a time, so it cannot do optimizations that takes advantage of this, and the code usually work as intended...)

    The standard is in addition a bit unclear about the finer details of struct aliasing, and there are some subtle (potential) problems. I expect to come back to this in a future blog post...

    ReplyDelete
  4. Krister,

    to be clear -- are you saying that even though the first element of a struct is guaranteed by the C standard to be at the same address as the struct itself, we can't up/downcast between them without invoking UB? (This is probably the "pseudo-OOP" you are referring to)

    I know of several frameworks and libraries that make good use of this fact (same address of struct & first member) since it's guaranteed by the standard.

    ReplyDelete
    Replies
    1. I'll use an example (of a somewhat simplified 'struct sockaddr*') to try to be clear what I mean (apologies for the formatting — I cannot get blogger to allow me to add relevant HTML tags...)

      The "pseudo-OOP" is that we sort of simulate inheritance, so we have a base class

      struct sockaddr {
      int sa_family; // address family, AF_INET, AF_INET6 etc.
      };

      and we "inherit" from this for the classes representing the protocols by letting them start with the fields from sockaddr:

      struct sockaddr_in {
      int sin_family; // AF_INET
      int sin_port;
      };
      struct sockaddr_in6 {
      int sin6_family; // AF_INET6
      struct in6_addr sin6_addr;
      };

      Our program calls a function foo(struct sockaddr *p) as

      struct sockaddr_in sa;
      sa.sin_family = AF_INET;
      sa.sin_port = 8001;
      foo(&sa);

      The following implementation of foo invokes UB:

      void foo(struct sockaddr *p) {
      if (p->sa_family == AF_INET) { // UB
      struct sockaddr_in *q = p;
      q->sin_port++;
      }

      My understanding of your question is as if the function was implemented as

      void foo(struct sockaddr *p) {
      int* tmp = p;
      if (*tmp == AF_INET) {
      struct sockaddr_in *q = p;
      q->sin_port++;
      }

      I'm not sure if this is allowed or not. The standard is unclear, and I believe there are some relevant defect reports on the standard discussing related issues. I am planning to study these issues in detail, and write about it in a future blog post. But it will probably take a while before I manages to ensure myself that I understand all the relevant details...

      Delete
    2. Hi again, and thank you for your answwer. As far as I understand, the cast between pointers to the following structures invokes UB:

      struct base {
      int tag;
      };
      struct child {
      int tag;
      long field;
      };



      I'm pretty sure though that the following is guaranteed to work:


      typedef struct {
      enum {
      INT,
      STRING,
      } value;
      } tag_t;
      typedef tag_t *any_t;


      typedef struct {
      tag_t tag;
      int integer;
      } int_t;

      typedef struct {
      tag_t tag;
      char const *string;
      } string_t;


      any_t mkint(void)
      {
      static int_t self = {{INT}, 42};
      return &self.tag;
      }

      any_t mkstring(void)
      {
      static string_t self = {{STRING}, "Hello, world!"};
      return &self.tag;
      }


      void print(any_t any)
      {
      union {
      tag tag;
      int_t integer;
      string_t string;
      } *self = (void*)(intptr_t)any;


      if (self->tag.value == INT) {
      printf("%d\n", self->integer.value);
      } else {
      printf("%s\n", self->string.value);
      }
      }


      int main(void)
      {
      print(mkint());
      print(mkstring());
      return 0;
      }


      Of course this example is artificial; in real life one would use malloc() inside mkint() and mkstring() and change the value based on the additional parameters of the mkint() and mkstring() functions. If you want I can share a link to my project where I can make use of this trick.

      Delete
    3. Ah, what a messy code. It'd be great if Blogspot supported a special formatting for the code in comments.

      Delete
    4. Yes, your example is fine. This is guaranteed by 6.5.2.3:

      One special guarantee is made in order to simplify the use of unions: if a union contains several structures that share a common initial sequence (see below), and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them anywhere that a declaration of the completed type of the union is visible. Two structures share a common initial sequence if corresponding members have compatible types (and, for bit-fields, the same widths) for a sequence of one or more initial members.

      Delete
  5. Until aliasing rules were added to the language, it would have been impractical for a compiler to honor the CIS rule when objects are part of a union without honoring it in all cases where a pointer of one struct type is used to access a CIS member of another. Describing the rule as applicable to pointers as well as unions would have been redundant even though the majority of code that benefits from the CIS rule uses it with structure pointers.

    Code which relied upon the CIS rule applying to structure pointers like struct sockaddr was from what I can tell non-controversial when C89 was written. Despite what you claim in footnote 1, I have seen no evidence that C89's rules were understood by the people approving them as treating the lvalue expression someStruct->someMember as something other than accessing an lvalue of type "someMember".

    I can see two plausible explanations for the C89 rules that have formed the basis of everything since:

    (1) The authors of C89 were intending to describe a language which was different from the language people were actually using (and probably shouldn't have been given the same name).

    (2) The authors of C89 wanted to allow compilers to optimize the kinds of code given in the rationale, observed that the authors of existing compilers that optimized such code exercised good judgment to avoid recklessly breaking existing code, and did not intend to particularly disturb the status quo.

    Because behaviors that would be reasonable for some target platforms and application fields would be unreasonable for others, the Standard leaves many decisions to implementers' judgment. The fact that the Standard does not forbid some particular behavior does not imply that such behavior would be appropriate for any particular platform or field, nor even that it would ever be appropriate in any quality implementation. I see no evidence that the authors of the Standard intended that compiler writers treat the fact that implementations are allowed to do something as implying that quality implementations should.

    BTW, C99's requirement that a complete union declaration be visible when using the CIS rule would not radically change the language from one that supported CIS access unconditionally if compilers writers interpreted it as saying that an access to a member of a struct type that appears within any union must be treated as a potential access to the corresponding CIS member of each struct type that has one and appears in the same union. Such an interpretation would make it possible to adapt almost all code which relies upon the ability to convert multiple kinds of struct pointers to a common type by defining after each individual structure type a union which contains that structure as well as the common type.

    I find it plausible that the authors of C99 intended to require that maintaining C99 compatibility for some existing programs would require adding some union type declarations. I find far less plausible the notion that the authors of C99 intended to invalidate a large corpus of existing code in ways that would be very difficult to repair.

    ReplyDelete

Note: Only a member of this blog may post a comment.