Monday, July 24, 2017

Phoronix SciMark benchmarking results

Phoronix recently published an article “Ryzen Compiler Performance: Clang 4/5 vs. GCC 6/7/8 Benchmarks”, and there are many results in that article that surprises me...

One such is the result for SciMark that shows that GCC generates much slower code than LLVM – there is a big difference in several tests, and the composite score is 25% lower. I do not have any Ryzen CPU to test on, but my testing on Broadwell shows very little difference between GCC and LLVM when SciMark is compiled with -O3 -march=x86-64 as in the article, and the Ryzen microarchitecture should not introduce that big difference. And the reported numbers seem low...

The Phoronix testing also shows strange performance variations between different GCC versions that I don’t see in my testing – I see a performance increase for each newer version of the compiler.

The published test results are made running scripts available at OpenBenchmarking.org,  and looking at the build script for SciMark shows that it is built as
cc $CFLAGS -o scimark2 -O *.c -lm
Note the -O – this overrides the optimization level set by $CFLAGS and explains at least some of the discrepancies in the test results.1 GCC maps -O to the -O1 optimization level that is meant to be a good choice to use while developing – it optimizes the code, but focuses as much on fast compilation time and good debug information as on producing fast code. LLVM maps -O to -O2 that is a “real” optimization level that prioritizes performance, so it is not surprising that LLVM produces faster code in this case.

So the benchmarking result does not show what is intended, and both compilers can do better than what the test results show...


1. I get similar results as the article when I use -O, but my result for FFT is very different...

Thursday, July 20, 2017

A load/store performance corner case

I have recently seen a number of “is X faster than Y?” discussions where micro benchmarks are used to determine the truth. But performance measuring is hard and may depend on seemingly irrelevant details...

Consider for example this code calculating a histogram
int histogram[256];

void calculate_histogram(unsigned char *p, int len)
{
  memset(histogram, 0, sizeof(histogram));
  for (int i = 0; i < len; i++)
    histogram[p[i]]++;
}
The performance “should not” depend on the distribution of the values in the buffer p, but running this on a buffer with all bytes set to 0 and one buffer with random values gives me the result (using the Google benchmark library and this code)
Benchmark                Time           CPU Iterations
------------------------------------------------------
BM_cleared/4096       7226 ns       7226 ns      96737
BM_random/4096        2049 ns       2049 ns     343001
That is, running on random data is 3.5x faster compared to running on all-zero data! The reason for this is that loads and stores are slow, and the CPU tries to improve performance by executing later instructions out of order. But it cannot proceed with a load before the previous store to that address has been done,1 which slows down progress when all loop iterations read and write the same memory address histogram[0] .

This is usually not much of a problem for normal programs as they have more instructions that can be executed out of order, but it is easy to trigger this kind of CPU corner cases when trying to measure the performance of small code fragments, which results in the benchmark measuring something else than intended. Do not trust benchmark results unless you can explain the performance and know how it applies to your use case...


1. The CPU does “store to load forwarding” that saves cycles by enabling the load to obtain the data directly from the store operation instead of through memory, but it still comes with a cost of a few cycles.

Tuesday, July 4, 2017

Strict aliasing in C90 vs. C99 – and how to read the C standard

I often see claims that the strict aliasing rules were introduced in C99, but that is not true – the relevant part of the standard is essentially the same for C90 and C99. Some compilers used the strict aliasing rules for optimization well before 1999 as was noted in this 1998 post to the GCC mailing list (that argues that enabling strict aliasing will not cause many problems as most software already has fixed their strict aliasing bugs to work with those other compilers...)

C99 – 6.5 Expressions

The C standard does not talk about “strict aliasing rules”, but they follow from the text in “6.5 Expressions”:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:73
  • a type compatible with the effective type of the object,
  • a qualified version of a type compatible with the effective type of the object,
  • a type that is the signed or unsigned type corresponding to the effective type of the object,
  • a a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.

73 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
Note the footnote that says that the intention of these rules is to let the compiler determine that objects are not aliased (and thus be able to optimize more aggressively).

C90 – 6.3 Expressions

The corresponding text in C90 is located in “6.3 Expressions”:
An object shall have its stored value accessed only by an lvalue that has one of the following types:36
  • the declared type of the object,
  • a qualified version of the declared type of the object,
  • a type that is the signed or unsigned type corresponding to the declared type of the object,
  • a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.

36 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
It is similar to the text in C99, and it even has the footnote that says it is meant to be used to determine if an object may be aliased or not, so C90 allows optimizations using the strict aliasing rules.

But standard have bugs, and those can be patched by publishing technical corrigenda, so it is not enough to read the published standard to see what is/is not allowed. There are two technical corrigenda published for C90 (ISO/IEC 9899 TCOR1 and ISO/IEC 9899 TCOR2), and the TCOR1 updates the two first bullet points. The corrected version of the standard says
An object shall have its stored value accessed only by an lvalue that has one of the following types:36
  • a type compatible with the declared type of the object,
  • a qualified version of a type compatible with the declared type of the object,
  • a type that is the signed or unsigned type corresponding to the declared type of the object,
  • a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.

36 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
The only difference compared to C99 is that it does not talk about effective type, which makes it unclear how malloc:ed memory is handled as it does not have a declared type. This is discussed in the defect report DR 28 that asks if it is allowed to optimize
void f(int *x, double *y) {
  *x = 0;
  *y = 3.14;
  *x = *x + 2;
} 
to
void f(int *x, double *y) {
  *x = 0;
  *y = 3.14;
  *x = 2; /* *x known to be zero */
}
if x and y point to malloc:ed memory, and the committee answered (citing the bullet point list from 6.3)
We must take recourse to intent. The intent is clear from the above two citations and from Footnote 36 on page 38: The intent of this list is to specify those circumstances in which an object may or may not be aliased.
Therefore, this alias is not permitted and the optimization is allowed.
In summary, yes, the rules do apply to dynamically allocated objects.
That is, the allocated memory gets its declared type when written and the subsequent reads must be done following the rules in the bullet-point list, which is essentially the same as what C99 says.

One difference between C90 and C99

There is one difference between the C90 and C99 strict aliasing rules in how unions are handled – C99 allows type-punning using code such as
union a_union {
  int i;
  float f;
};

int f() {
  union a_union t;
  t.f = 3.0;
  return t.i;
}
while this is implementation-defined in C90 per 6.3.2.3
[...] if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined.

Reading the standard

Language lawyering is a popular sport on the internet, but it is a strange game where often the only winning move is not to play. Take for example DR 258 where the committee is asked about a special case in macro-expansion that is unclear. The committee answers
The standard does not clearly specify what happens in this case, so portable programs should not use these sorts of constructs.
That is, unclear parts of the standard should be avoided – not tried to get language lawyered into saying what you want.

And the committee is pragmatic; DR 464 is a case where the defect report asks to add an example for a construct involving the #line directive that some compilers get wrong, but the committee thought it was better to make it unspecified behavior
Investigation during the meeting revealed that several (in fact all that were tested) compilers did not seem to follow the interpretation of the standard as given in N1842, and that it would be best to acknowledge this as unspecified behavior.
So just because the standard says something does not mean that it is the specified behavior. One other fun example of this is DR 476 where the standard does not make sense with respect to the behavior of volatile:
All implementors represented on the committee were polled and all confirmed that indeed, the intent, not the standard, is implemented. In addition to the linux experience documented in the paper, at least two committee members described discussions with systems engineers where this difference between the standard vs the implementation was discussed because the systems engineers clearly depended on the implementation of actual intent. The sense was that this was simply a well known discrepency.

Saturday, July 1, 2017

Hard-coded hardware addresses in C/C++

I read a blog post “reinterpret_cast vs. constant expression” that discusses how to get rid of C-style casts for code such as
#define FOO ((struct S*)0xdff000)
But there is no need to have hard-coded addresses in the code – it is better to declare a normal structure
extern struct S hw_s;
and tell the linker to place it at address 0xdff000 using an assembly file containing the lines
.global hw_s
hw_s = 0xdff000
FOO can now be defined without a cast
#define FOO &hw_s
although it is probably better to use hw_s directly...

It is good to get rid of hard-coded addresses in C/C++ code even if you do not care about ugly casts. One reason is that the compiler cannot know which objects the hard-coded addresses point to, which restricts the data flow analysis. One other reason is that hard-coded addresses interact badly with instruction selection in the backend. This is especially true for code accessing hardware registers that expand to assignments of the form
*(volatile int *)(0xdff008) = 0;
*(volatile int *)(0xdff010) = 10;
The best way of generating the code depends on the CPU architecture, but it usually involves loading a base address into a register and storing using a “base + offset” addressing mode, so the compiler needs to split and re-combine the addresses (which is complicated as there are often restrictions on which offsets are valid, the cost of the base depends on the value, etc.). The ARM backend is good at this, but I have seen many cases where much slower and larger code than necessary is generated for more obscure architectures. For example, GCC 7.1 for RISC-V compiles
void foo(void)
{
  *(volatile int *)(0xfffff00023400008) = 0;
  *(volatile int *)(0xfffff00023400010) = 10;
}
to
foo:
 lui a5,%hi(.LC0)
 ld a5,%lo(.LC0)(a5)
 li a4,10
 sw zero,0(a5)
 lui a5,%hi(.LC1)
 ld a5,%lo(.LC1)(a5)
 sw a4,0(a5)
 ret
.LC0:
 .dword -17591594647544
.LC1:
 .dword -17591594647536
instead of the smaller and faster
foo:
 lui a5,%hi(.LC0)
 ld a5,%lo(.LC0)(a5)
 li a4,10
 sw zero,8(a5)
 sw a4,16(a5)
 ret
.LC0:
 .dword -17591594647552
you get by writing through a normal structure.

Monday, June 19, 2017

A look at range-v3 code generation

I recently saw a Stack Overflow post that compared the speed of std::find_if on a vector vec
auto accumulated_length = 0L;
auto found = std::find_if(vec.begin(), vec.end(),
                          [&](auto const &val) {
                              accumulated_length += val;
                              return to_find < accumulated_length;
                          });
auto const found_index = std::distance(vec.begin(), found);

and the equivalent code using the range-v3 library
auto const found_index = ranges::distance(vec
    | ranges::view::transform(ranges::convert_to<long>{})
    | ranges::view::partial_sum()
    | ranges::view::take_while([=](auto const i) {
          return !(to_find < i);
      }));
Measuring the performance on an Intel Broadwell CPU using the Google benchmark library and this code compiled with the options
-O3 -march=native -std=c++14 -DNDEBUG
gives me the result
Benchmark                Time           CPU Iterations
------------------------------------------------------
BM_std/1024            311 ns        311 ns    2248354
BM_range/1024         2102 ns       2102 ns     332711
for gcc 7.1.0 and
BM_std/1024            317 ns        317 ns    2208547
BM_range/1024          809 ns        809 ns     864328
for clang 4.0.0. There are two obvious questions
  • Why is range-v3 slower than the STL?
  • Why is the difference so much bigger for GCC than for LLVM?

I also wanted to see if the STL added overhead, so I tried a simple C-style for-loop
long i, acc = 0;
for (i = 0; i < len; i++) {
    acc += p[i];
    if (to_find < acc)
        break;
}
found_index = i;
This runs in 439 ns – 40% slower than the STL version! – which adds the question
  • Why is the for-loop slower than the STL version?

Why is the for-loop slower?

GCC is generating the obvious assembly for the for-loop
.L4:
        movslq  (%r8,%rax,4), %rcx
        addq    %rcx, %rdx
        cmpq    %rsi, %rdx
        jg      .L7
.L3:
        addq    $1, %rax
        cmpq    %rdi, %rax
        jl      .L4
.L7:
        ...
I had expected the compiler to generate similar code for std::find_if, and that is what happens if it is used with an input iterator, but libstdc++ has an overload for random-access iterators which partially unrolls the loop
template<typename _RandomAccessIterator, typename _Predicate>
  _RandomAccessIterator
  __find_if(_RandomAccessIterator __first, _RandomAccessIterator __last,
            _Predicate __pred, random_access_iterator_tag)
  {
    typename iterator_traits<_RandomAccessIterator>::difference_type
      __trip_count = (__last - __first) >> 2;

    for (; __trip_count > 0; --__trip_count)
      {
        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;

        if (__pred(__first))
          return __first;
        ++__first;
      }

    switch (__last - __first)
      {
      case 3:
        if (__pred(__first))
          return __first;
        ++__first;
      case 2:
        if (__pred(__first))
          return __first;
        ++__first;
      case 1:
        if (__pred(__first))
          return __first;
        ++__first;
      case 0:
      default:
        return __last;
      }
  }
This partial unrolling gets rid of a large fraction of the comparisons and branches, which makes a big difference for this kind of micro-benchmark.

Why does GCC generate slow code for range-v3?

The range-v3 code generated by GCC have a few objects placed on the stack which adds some (useless) memory operations. The reason they are not optimized has to do with how GCC are optimizing structures and the order the optimization passes are being run.

The GCC “Scalar Replacement of Aggregates” (SRA) optimization pass splits structures into their elements. That is,
struct S {
  int a, b, c;
};

struct S s;

s.a = s.b = s.c = 0;
...
is transformed to the equivalent of
int a, b, c;

a = b = c = 0;
...
and the variables are then optimized and placed in registers in the same way as normal non-structure variables.

The compiler cannot split structures that have their address taken as it would then need to do expensive pointer tracking to find how each element is used, so such structures are kept on the stack. The GCC SRA pass is conservative and does not split a structure if any part of it has been captured by a pointer, such as
struct S s;

s.a = s.b = s.c = 0;
int *p = &s.a;
...
that could be split into
int a, b, c;

a = b = c = 0;
int *p = &a;
...
but that is not done by GCC.

It is usually not a problem that address-taking limits SRA as optimization passes such as constant propagation eliminates use of pointers when they are only used locally in a function, so code of the form
struct S s;

int *p = &s.a;
...
*p = 0;
is transformed to
struct S s;

...
s.a = 0;
which can then be optimized by SRA. But this requires that all paths to the use of p pass through the same initialization and that the compiler can see that they pass through the same initialization – we cannot easily eliminate the pointers for code such as
struct S s;
int *p;

if (cond)
    p = &s.a;
...
if (cond)
    *p = 0;
that need the compiler to track values to see that all executions of *p initializes p to &s.a.

And that is how the range-v3 code looks like after templates has been expanded and all functions inlined – the code does different initializations depending on if the range is empty or not and ends up with code segments of the form
if (begin != end) {
    // Initialize some variables
}

...

if (begin != end) {
    // Use the variables
}
I have a hard time trying to follow exactly what range-v3 is trying to do – the code expands to more than 700 functions, so I have only looked at the compiler’s IR after inlining and I do not know exactly how it look in the C++ source code – but the result is that the compiler fails to propagate some addresses due to this issue and three objects (one struct take_while_view and two struct basic_iterator) are still placed on the stack when the last SRA pass has been run.

GCC do eventually manage to simplify the code enough that SRA could eliminate all structures, but that is later in the optimization pipeline, after the last SRA pass has been run. I tested to add an extra late SRA pass – this eliminates the memory operations, and the function runs in 709 ns. Much better, but still only half the speed of the STL version.

Why is range-v3 slower than the STL?

Both GCC and LLVM generate the range-v3 code to something of the form
static long foo(const int *begin, const int *end, long to_find)
{
    long result = 0;
    const int *p = begin;
    if (begin != end) {
        result = *begin;
        while (1) {
            if (p == end)
                break;
            if (to_find < result)
                break;

            p++;
            if (p != end)
                result += *p;
        }
    }
    return p - begin;
}
that does one extra comparison in the loop body compared to the for-loop version. This kind of code is supposed to be simplified by the loop optimizers, but they are running relatively early in the optimization pipeline (partly so that later optimizations may take advantage of the improved loop structure, and partly as many optimizations makes life harder for the loop optimizer) so they are limited by the same issues mentioned in the previous section – that is, I assume the redundant comparison would be eliminated if the range-v3 library improved its handling of empty ranges etc.

Sunday, June 4, 2017

-fipa-pta

My previous blog post had a minimal description of -fipa-pta and I have received several questions about what it actually do. This blog post will try to give some more details...

Points-to analysis

Many optimizations need to know if two operations may access the same memory address. For example, the if-statement in
i = 5;
*p = -1;
if (i < 0)
  do_something();
can be optimized away if *p cannot modify i.

GCC tracks what the pointers may point to using the general ideas from the paper “Efficient Field-sensitive pointer analysis for C”. I will not describe the details – the first few pages of the paper do it better than I can do here – but the principle is that each pointer is represented by a set of locations it may point to, the compiler is generating set constraints representing each statement in the program, and then solving those constraints to get the actual set of locations the pointer may point to.

But this process is expensive, so GCC is normally doing this one function at a time and assumes that called functions may access any memory visible to them.

-fipa-pta

The -fipa-pta optimization takes the bodies of the called functions into account when doing the analysis, so compiling
void __attribute__((noinline))
bar(int *x, int *y)
{
  *x = *y;
}

int foo(void)
{
  int a, b = 5;
  bar(&a, &b);
  return b + 10;
}
with -fipa-pta makes the compiler see that bar does not modify b, and the compiler optimizes foo by changing b+10 to 15
int foo(void)
{
  int a, b = 5;
  bar(&a, &b);
  return 15;
}

A more relevant example is the “slow” code from the “Integer division is slow” blog post
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);

for (int i = 0; i < 1000000000; i++) {
  volatile auto r = theIntDist(randGenerator);
}
Compiling this with -fipa-pta makes the compiler see that theIntDist is not modified within the loop, and the inlined code can thus be constant-folded in the same way as the “fast” version – with the result that it runs four times faster.

Tuesday, May 30, 2017

Interprocedural optimization in GCC

Compilers can do a better job optimizing a function if they can use knowledge of other functions. The obvious case is inlining, but there are many more cases. This post lists the interprocedural optimizations implemented in GCC 7.

Many of the optimizations are only relevant for large functions (small functions are inlined into the caller!) or for helping other optimization passes. This makes it hard to give relevant examples, so the examples in this post are just illustrating the principles.

Parameter passing

Parameter passing for functions where GCC can see all callers (such as functions that are local to a translating unit, or when the whole program is compiled using link-time optimization) is optimized as
  • Unused parameters are removed.
  • Parameters passed  by reference may be changed to be passed by value. For example,
    static int foo(int *m)
    {
      return *m + 1;
    }
    
    int bar(void)
    {
      int i = 1;
      return foo(&i);
    }
    
    is changed to
    static int foo(int m)
    {
      return m + 1;
    }
    
    int bar(void)
    {
      int i = 1;
      return foo(i);
    }
    
    which makes it much easier for other optimization passes to reason about the variables.
  • A structure may be split into its elements. For example,
    struct bovid
    {
      float red;
      int green;
      void *blue;
    };
    
    static void ox(struct bovid *cow)
    {
      cow->red = cow->red + cow->green;
    }
    
    int main(void)
    {
      struct bovid cow;
    
      cow.red = 7.4;
      cow.green = 6;
      cow.blue = &cow;
    
      ox(&cow);
    
      return 0;
    }
    
    is changed to
    struct bovid
    {
      float red;
      int green;
      void *blue;
    };
    
    static void ox(float *t1, int t2)
    {
      *t1 = *t1 + t2;
    }
    
    int main(void)
    {
      struct bovid cow;
    
      cow.red = 7.4;
      cow.green = 6;
      cow.blue = &cow;
    
      ox(&cow.red, cow.green);
    
      return 0;
    }

These optimizations are enabled by -fipa-sra, which is enabled by default at -Os, -O2, and -O3.

Constant propagation

Functions where all callers pass the same constant can be optimized by propagating the constant into the function. That is,
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

int bar(int m, int n)
{
  return foo(m, 7) + foo(n, 7);  
}
is optimized to
static int foo(int a)
{
  return a + 7;
}

int bar(int m, int n)
{
  return foo(m) + foo(n);
}

The constants can be propagated bitwise, which is useful for flag parameters. For example
static int foo(int a, int b)
{
  if (b & 4)
    return a & (b & 1);
  else
    return a & (b & 2);
}

int bar(int m, int n)
{
  return foo(m, 9) | foo(n, 3);
}
is optimized to
static int foo(int a, int b)
{
  return a & (b & 2);
}

int bar(int m, int n)
{
  return foo(m, 9) | foo(n, 3);
}

The constants do not need to be the same in all function calls – GCC tracks ranges of possible values and optimize as appropriate, so
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

int bar(int m, int n)
{
  return foo(m, 5) + foo(n, 7);  
}
is optimized to
static int foo(int a, int b)
{
  return a + b;
}

int bar(int m, int n)
{
  return foo(m, 5) + foo(n, 7);  
}
as both 5 and 7 are greater than 0.

These optimizations are enabled by -fipa-cp, -fipa-bit-cp, and -fipa-vrp, which are enabled by default at -Os, -O2, and -O3.

Constant propagation – cloning

It is often the case that only a few of the function calls pass constants as parameters, or that the constants are conflicting so they cannot be propagated into the called function. GCC handles this by cloning the called function to let each conflicting call get its own version. For example,
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

int bar(int m, int n)
{
  return foo(m, 5) + foo(m, n);  
}
creates one clone of foo and optimizes it using the constant 5 for the parameter b
static int foo(int a, int b)
{
  if (b > 0)
    return a + b;
  else
    return a * b;
}

static int foo_clone(int a)
{
  return a + 5;
}

int bar(int m, int n)
{
  return foo_clone(m) + foo(m, n);  
}

This optimization is enabled by -fipa-cp-clone, which is enabled by default at -O3.

Devirtualization

Devirtualization (converting calls to virtual functions to direct calls – see Jan Hubička's blog series on how devirtualization works in GCC) is helped by propagating type information in roughly the same way as the constants are propagated, and is implemented by the constant propagation pass.

This is enabled by -fipa-cp and -fdevirtualize, which are enabled by default at -Os, -O2, and -O3.

Caller-saved registers

Caller saved registers do not need to be saved if those registers are not used by the called function.

This optimization is enabled by -fipa-ra, which is enabled by default at -Os, -O2, and -O3.

Identical code folding

The “identical code folding pass” merges identical functions. The functions do not need to be identical in the source code – the merging is done halfway through the optimization pipeline so it is enough that they have the same structure after simplification (and variable names etc. does not matter).

Functions that may be used outside the compilation unit cannot be completely merged as the C and C++ standards require that functions have unique addresses. GCC solves this by adding wrappers for the exported symbols, so that
#include <stdio.h>

void foo(char *s)
{
    printf("Hello %s\n", s);
}

void bar(char *s)
{
    printf("Hello %s\n", s);
}
is generated as
.LC0:
        .string "Hello %s\n"

foo:
        mov     rsi, rdi
        xor     eax, eax
        mov     edi, OFFSET FLAT:.LC0
        jmp     printf

bar:
        jmp     foo

This optimization is enabled by -fipa-icf, which is enabled by default at -Os, -O2, and -O3.

Profile propagation

Many optimizations have different heuristics depending on how much the code is executed. The compiler estimates branch frequencies and propagates this information between functions so that, for example, a function only called from “cold” code segments is treated as a “cold” function.

This is enabled by -fipa-profile, which is enabled by default at -O and higher.

Pure, const, and noexcept

GCC analyzes functions to determine if they access memory or may throw exceptions, propagates this information throughout the compilation unit, and annotates the functions with pure, const, and noexcept attributes when possible, which helps other optimizations.

This optimization is enabled by -fipa-pure-const, which is enabled by default at -O and higher.

Global variables

It is in general hard to optimize usage of global variables, but it is easy to improve usage of global variables that cannot escape the compilation unit and that do not have the address taken. There are three optimizations done on such variables
  • Removal of global variables that are never read.
  • A global variable that is used in only one function may be changed to a local variable in that function.
  • The compiler tracks which functions modifies the variables so that loads and stores may be moved over function calls that do not touch the variable. For example, the function bar in
    static int g;
    
    void foo(void)
    {
      // Code not touching g
    }
    
    int bar(void)
    {
      g += 1;
      foo();
      g += 2;
    }
    
    is optimized to
    int bar(void)
    {
      foo();
      g += 3;
    }
    
These optimizations are enabled by -fipa-reference, which is enabled by default at -O and higher.

Pointer analysis

GCC can do interprocedural pointer analysis, which is enabled by -fipa-pta. This optimization is not enabled by default at any optimization level as it can cause excessive memory and compile-time usage on large compilation units.

Sunday, May 21, 2017

Seeding the std::mt19937 random number engine

A comment on Hacker News complained that the code in my previous blog post does not seed the std::mt19937 random number engine properly. The code was taken directly from a CppCon presentation, so I don’t want to take the blame, but the comment is right — the initialization code can be improved.

State size and seeding

The initialization in the blog post was done as
std::random_device rd;
std::mt19937 gen(rd());
which seeds the std::mt19937 random number engine with a random 32-bit value. The problem with this is that that the Mersenne twister has 19968 bits of internal state so it can generate \(2^{19968}\) streams of random values, but we can only reach \(2^{32}\) of those states when initializing with a 32-bit value.

This is not necessarily a problem. Let’s say the random numbers are used for generating input data in unit tests. The test suite is probably not run more than a few thousand times, so it does not matter that it only can create \(2^{32}\) different test runs. But there are use-cases where this is a problem.

The random number engine can be seeded with more data by using std::seed_seq, and the code below seeds the std::mt19937 with the same number of bits as are in the state
std::random_device rd;
std::array<int, std::mt19937::state_size> seed_data;
std::generate_n(seed_data.data(), seed_data.size(), std::ref(rd));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
std::mt19937 gen(seq);

std::random_device

One other potential problem is the quality of the seed values. The idea behind std::random_device is that it returns non-deterministic random numbers, but it is allowed to return deterministic values (e.g. if a non-deterministic source is not available to the implementation). I’m not a big fan of this functionality — it either does exactly what you want (generates non-deterministic values) or it does the opposite (generates deterministic values), and there is no way you can determine which.1

This is probably not a problem when developing for the big platforms, but there may be surprises when running the code in other environments — at least old versions of libstdc++ on MinGW always return the same sequence of values...


1. The std::random_device can return an estimate of the entropy, and it is required to return 0 if the values are generated deterministically. But it is not required to return non-zero for the non-deterministic case, and e.g. libstdc++ is conservative and always estimates the entropy as 0, even when /dev/urandom or the x86 RDRND instruction are used.

Tuesday, May 16, 2017

Integer division is slow

The CppCon 2016 talk “I Just Wanted a Random Integer!” benchmarks randomization functionality from the C++ standard library (using GCC 5.1). There is one surprising result — the loop
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);

for (int i = 0; i < 1000000000; i++) {
  volatile auto r = theIntDist(randGenerator);
}
need 23.4 seconds to run while
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());

for (int i = 0; i < 1000000000; i++) {
  std::uniform_int_distribution<int> theIntDist(0, 99);
  volatile auto r = theIntDist(randGenerator);
}
run in 5.1 seconds. But the latter should intuitively be slower as it does more in the loop...

Code expansion

The functionality in the standard library is implemented using template magic, but the compiler’s view of the code after inlining and basic simplification is that
std::uniform_int_distribution<int> theIntDist(0, 99);
is just defining and initializing a structure
struct {
  int a, b;
} theIntDist;
theIntDist.a = 0;
theIntDist.b = 99;
while the call
volatile auto r = theIntDist(randGenerator);
is expanded to the equivalent of
uint64_t ret;
uint64_t urange = theIntDist.b - theIntDist.a;
if (0xffffffff > urange) {
  const uint64_t uerange = urange + 1;
  const uint64_t scaling = 0xffffffff / uerange;
  const uint64_t past = uerange * scaling;
  do {
    ret = mersenne_twister_engine(randGenerator);
  } while (ret >= past);
  ret /= scaling;
} else {
  ...
  uniform_int_distribution(&theIntDist, randGenerator);
  ...
  ret = ...
}
volatile int r = ret + theIntDist.a;
where I have used ... for code that is not relevant for the rest of the discussion.

Optimization differences

It is now easy to see why the second case is faster — creating theIntDist in the loop makes it trivial for the compiler to determine that urange has the value 99, and the code simplifies to
uint64_t ret;
do {
  ret = mersenne_twister_engine(randGenerator);
} while (ret >= 4294967200);
ret /= 42949672;
volatile int r = ret;
This simplification is not possible when theIntDist is created outside of the loop — the compiler sees that the loop calls uniform_int_distribution with a reference to theIntDist, so it must assume that the value of theIntDist.a and theIntDist.b may change during the execution and can therefore not do the constant folding. The function does, however, not modify theIntDist, so both versions of the program do the same work, but the slow version needs to do one extra comparison/branch and a few extra arithmetic instructions for each loop iteration.

The cost of division

The mersenne_twister_engine is not a big function, but it is not trivial — it executes about 40 instructions — so it is surprising that adding a few instructions to the loop makes the program four times slower. I described a similar case in a previous blog post where the problem were due to branch mis-prediction, but the branch is perfectly predicted in  this example.

The reason here is that the slow loop need to do an integer division instruction when calculating scaling, and integer division is expensive — Agner Fog’s instruction tables says that the 64-bit division may need up to 103 cycles on the Broadwell microarchitecture! This usually does not matter too much for normal programs as as the compiler tries to move the division instructions so that they have as much time as possible to execute before the result is needed, and the CPU can in general continue executing other instructions out of order while waiting for the result of the division. But it does make a big difference in this kind of micro-benchmarks as the compiler cannot move the division earlier, and the CPU runs out of work to do out of order as the mersenne_twister_engine function executes much faster than the division.

Monday, April 10, 2017

Building GCC with support for NVIDIA PTX offloading

GCC can offload C, C++, and Fortran code to an accelerator when using OpenACC or OpenMP where the code to offload is controlled by adding #pragma statements (or magic comments for Fortran), such as
#pragma acc kernels
for (int j = 1; j < n-1; j++) {
  for (int i = 1; i < m-1; i++) {
    Anew[j][i] = 0.25f * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
    error = fmaxf(error, fabsf(Anew[j][i] - A[j][i]));
  }
}
This blog post describes what I needed to do in order to build a GCC trunk compiler with support for offloading to NVIDIA GPUs on Ubuntu 16.10.

The first step is to install the NVIDIA CUDA toolkit. Googling shows lots of strange suggestions about what you need to do in order to get this to work (blacklisting drivers, adding the PCI address of your video card to config files, etc.), but it worked fine for me to just download the “deb (local)” file, and install it as
sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda
The toolkit is installed in /usr/local/cuda, and /usr/local/cuda/bin must be added to PATH so that GCC may find the ptxas tool.

The script below fetches the source code and builds the compiler and tools

Add $install_dir/lib64 to LD_LIBRARY_PATH, and the compiler can now be used to offload OpenACC code by compiling as
$install_dir/bin/gcc -O3 -fopenacc test.c
or OpenMP as
$install_dir/bin/gcc -O3 -fopenmp test.c
You may need to pass -foffload=-lm to the compiler if the code you offload contains math functions that cannot be directly generated as PTX instructions.

Saturday, March 25, 2017

pre-decrement vs. post-decrement, etc.

A recent talk at the OpenIoT Summit NA 2017, “Optimizing C for Microcontrollers — Best Practices”, had three examples illustrating the effect of different code constructs
  • Array subscript vs. pointer access
  • Loops (increment vs. decrement)
  • Loops (post-decrement vs. pre-decrement)
as compiled using GCC 6.x on ARM and the -Os optimization level. This blog post will look a bit closer at those examples, and discuss why the conclusions are not always valid.

Array subscript vs. pointer access

The first example is meant to illustrate the difference between array subscripts and pointer access with the two functions
int a[5];

int foo1(void)
{
    int i;
    int res = 0;
    for (i = 0; i < 5; i++)
        res += a[i];
    return res;
}
and
int a[5];

int foo2(void)
{
    int *p;
    int i;
    int res = 0;
    for (p = a, i = 0; i < 5; i++, p++)
        res += *p;
    return res;
}
The first function is generated in a natural way
foo1:
    movs  r0, #0
    mov   r3, r0
    ldr   r1, .L5
.L3:
    ldr   r2, [r1, r3, lsl #2]
    adds  r3, r3, #1
    cmp   r3, #5
    add   r0, r0, r2
    bne   .L3
    bx    lr
while the second function has its loop unrolled
foo2:
    ldr   r3, .L3
    ldm   r3, {r0, r2}
    add   r0, r0, r2
    ldr   r2, [r3, #8]
    add   r0, r0, r2
    ldr   r2, [r3, #12]
    ldr   r3, [r3, #16]
    add   r0, r0, r2
    add   r0, r0, r3
    bx    lr
The reason for this difference is that compiling with -Os should not unroll loops if unrolling increases the code size. But it is hard to estimate the resulting code size, as later optimization passes should be able to take advantage of the unrolling and be able to remove redundant code, so the compiler is using a rather imprecise heuristic. These loops are really close to the threshold (unrolling increases the code size by 4 bytes) and the minor difference between how the loops look when passed to the unroller makes the heuristic estimate that unrolling foo1 will increase the size by one instruction while foo2 will get the same size after unrolling.

This does, however, not illustrate any fundamental difference in the compiler’s understanding of array subscript compared pointer access — any difference in the code could affect a heuristic and have a similar effect (I have worked on compilers that generate different code if you rename variables or even add a comment!).1

Loops (increment vs. decrement)

The second example uses the two functions
void foo1(void)
{
    int x = 0;
    do {
        printk("X = %d\n", x);
        x++;
    } while (x < 100);
}
and
void foo2(void)
{
    int x = 100;
    do {
        printk("X = %d\n", x);
        x--;
    } while (x);
}
to illustrate that it is better to write loops decrementing the iteration variable, as the CPU can do the end of loop check for free as
subs  r4, r4, #1
bne   .L3 
instead of
adds  r4, r4, #1
cmp   r4, #100
bne   .L3
That is true, but the compiler can in many cases transform the loop to change iteration order, so the iteration order in the generated program depend more on what the loop does than how it iterates in the source code.

Note that the two functions do not do the same thing — foo1 outputs the numbers in increasing order and foo2 outputs them in decreasing order. Modifying foo2 to do the same thing as foo1, by changing the function call to
printk("X = %d\n", 100 - x);
makes it generate identical code as foo1 (as the compiler decides that it is better to iterate using increments in order to eliminate the subtraction) even though the function was written as using decrements.

Loops (post-decrement vs. pre-decrement)

The third example consider pre- vs. post-decrement using the examples
void foo1(void)
{
    unsigned int x = 10;
    do {
        if (--x) {
            printk("X = %d\n", x);
        } else {
            printk("X = %d\n", x);
            x = 10;
        }
    } while (1);
}
and
void foo2(void)
{
    unsigned int x = 9;
    do {
        if (x--) {
            printk("X = %d\n", x);
        } else {
            printk("X = %d\n", x);
            x = 9;
        }
    } while (1);
}
The example is meant to illustrate that --x is better, as it can get the comparison as a side effect of the subtraction in the same way as the previous example
subs  r4, r4, #1
bne   .L3 
but it depends much on the microarchitecture if this is beneficial or not. Many microarchitectures can do compare and branch efficiently,2 so a compare and a branch are not necessarily slower than branching on the status code from the subtraction. The problem with --x is that it adds a data dependency — you must do the subtraction before you can evaluate the if-statement. With x-- you can evaluate the if-statement and subtraction in parallel, with the result that
if (--x)
need one extra cycle to execute compared to
if (x--)
for superscalar CPUs having efficient compare and branch.


1. This typically happens when the compiler has different equivalent choices (for example, should it spill variable a or b to the stack), and it just chooses the first alternative. The first alternative is found by iterating over some kind of container, and this container may be an associative array using pointer values as the key...
2. For example, x86 CPUs tend to fuse cmp and jne so that they execute as one instruction.

Sunday, March 5, 2017

The cost of conditional moves and branches

The previous blog post contained an example where branching was much more expensive than using a conditional move, but it is easy to find cases where conditional moves reduce performance noticeably. One such case is in this stack overflow question (and GCC bug 56309) discussing the performance of a function implementing a naive bignum multiplication
static void inline
single_mult(const std::vector<ull>::iterator& data,
            const std::vector<ull>::const_iterator& rbegin,
            const std::vector<ull>::const_iterator& rend,
            const ull x)
{
    ull tmp=0, carry=0, i=0;
    for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it)
    {
        tmp = x * (*rhs_it) + data[i] + carry;
        if (tmp >= imax) {
            carry = tmp >> numbits;
            tmp &= imax - 1;
        } else { 
            carry = 0;
        }
        data[i++] = tmp;
    }
    data[i] += carry;
}

void
naive(std::vector<ull>::iterator data, 
      std::vector<ull>::const_iterator cbegin,
      std::vector<ull>::const_iterator cend,
      std::vector<ull>::const_iterator rbegin,
      std::vector<ull>::const_iterator rend)
{
    for (auto data_it = cbegin; data_it != cend; ++data_it)
    {
        if (*data_it != 0) {
            single_mult(data, rbegin, rend, *data_it);
        }
        ++data;
    }
}
Minor changes to the source code made the compiler use conditional moves instead of a branch, and this reduced the performance by 25%.

The difference between branches and conditional moves can be illustrated by
a = a + b;
if (c > 0)
    a = -a;
a = a + 1;
It is not possible to calculate the number of clock cycles for a code segment when working with reasonably complex CPUs, but it is often easy to get a good estimate (see e.g. this example for how to use such estimates when optimizing assembly code). The CPU converts the original instructions to micro-ops, and it can dispatch several micro-ops per cycle (e.g. 8 for Broadwell). The details are somewhat complicated,1 but most instructions in this blog post are translated to one micro-op that can be executed without any restrictions.

An assembly version using a branch looks like (assuming that the variables are placed in registers)
    addl    %edx, %eax
    testl   %ecx, %ecx
    jle     .L2
    negl    %eax
.L2:
    addl    $1, %eax
The CPU combines the testl and jle instructions to one micro-op by what is called “macro-fusion”, so both the addition and the test/branch instructions can be dispatched in the first cycle. It takes a while for the compare and branch to execute, but branch prediction means that the CPU can speculatively start executing the next instruction in the following cycle, so the final addl or the negl can be dispatched in the second cycle (depending on if the branch is predicted as taken or not). The result is that the code segment is done in 2 or 3 cycles, provided that the branch prediction was correct — a mispredict must discard the speculated instructions and restart execution, which typically adds 15–20 cycles.

Generating a version using a conditional move produces something like
    addl    %edx, %eax
    movl    %eax, %edx
    negl    %edx
    testl   %ecx, %ecx
    cmovg   %edx, %eax
    addl    $1, %eax
The first cycle will execute the first addition and the test instruction, and the following cycles will only be able to execute one instruction at a time as all of them depend on the previous instruction. The result is that this needs 5 cycles to execute.2

So the version with conditional moves takes twice the time to execute compared to the version using a branch, which is noticeable in the kind of short loops from single_mult. In addition, pipeline-restrictions on how instructions can be dispatched (such as only one division instruction can be dispatched each cycle) makes it hard for the CPU to schedule long dependency chains efficiently, which may be a problem for more complex code.


1. See “Intel 64 and IA-32 Architectures Optimization Reference Manual” and Agner Fog’s optimization manuals for the details.
2. This assumes that the cmovg instruction is one micro-op. That is true for some CPUs such as Broadwell, while others split it into two micro-ops.

Wednesday, February 22, 2017

Branch prediction

Branch misprediction is expensive: an example

The SciMark 2.0 Monte Carlo benchmark is calculating \(\pi\) by generating random points \(\{(x,y) \mid x,y \in [0,1]\}\) and calculating the ratio of points that are located within the quarter circle \(\sqrt{x^2 + y^2} \le 1\). The square root can be avoided by squaring both sides, and the benchmark is implemented as
double MonteCarlo_integrate(int Num_samples)
{
    Random R = new_Random_seed(SEED);
    int under_curve = 0;

    for (int count = 0; count < Num_samples; count++)
    {
        double x = Random_nextDouble(R);
        double y = Random_nextDouble(R);
        if (x*x + y*y <= 1.0)
            under_curve++;
    }

    Random_delete(R);
    return ((double) under_curve / Num_samples) * 4.0;
}
GCC used to generate a conditional move for this if-statement, but a recent change made this generate a normal branch which caused a 30% performance reduction for the benchmark due to the branch being mispredicted (bug 79389).

The randomization function is not inlined as it is compiled in a separate file, and it contains a non-trivial amount of loads, stores, and branches
typedef struct
{
    int m[17];
    int seed, i, j, haveRange;
    double left, right, width;
} Random_struct, *Random;

#define MDIG 32
#define ONE 1
static const int m1 = (ONE << (MDIG-2)) + ((ONE << (MDIG-2)) - ONE);
static const int m2 = ONE << MDIG/2;
static double dm1;

double Random_nextDouble(Random R) 
{
    int I = R->i;
    int J = R->j;
    int *m = R->m;

    int k = m[I] - m[J];
    if (k < 0)
        k += m1;
    R->m[J] = k;

    if (I == 0) 
        I = 16;
    else
        I--;
    R->i = I;

    if (J == 0) 
        J = 16;
    else
        J--;
    R->j = J;

    if (R->haveRange) 
        return  R->left +  dm1 * (double) k * R->width;
    else
        return dm1 * (double) k;
}
so I had expected the two calls to this function to dominate the running time, and that the cost of the branch would not affect the benchmark too much. But I should have known better — x86 CPUs can have more than 100 instructions in flight (192 micro-ops for Broadwell), and a mispredict need to throw away all that work and restart from the actual branch target.


Branch overhead and branch prediction

The cost of branch instructions differ between different CPU implementations, and the compiler needs to take that into account when optimizing and generating branches.

Simple processors with a 3-stage pipeline fetch the next instruction when previous two instructions are decoded and executed, but branches introduce a problem: the next instruction cannot be fetched before the address is calculated by executing the branch instruction. This makes branches expensive as they introduce bubbles in the pipeline. The cost can be reduced for conditional branches by speculatively fetching and decoding the instructions after the branch — this improves performance if the branch was not taken, but taken branches need to discard the speculated work, and restart from the actual branch target.

Some CPUs have instructions that can execute conditionally depending on a condition, and this can be used to avoid branches. For example
if (c)
    a += 3;
else
    b -= 2;
can be compiled to the following straight line code on ARM (assuming that a, b, and c are placed in r1, r2, and r0 respectively)
cmp     r0, #0
addne   r1, r1, #3
subeq   r2, r2, #2
The cmp instruction sets the Z flag in the status register, and the addne instruction is treated as an addition if Z is 0, and as a nop instruction if Z is 1. subeq is similarly treated as a subtraction if Z is 1 and as a nop if Z is 0. The instruction takes time to execute, even when treated as a nop, but this is still much faster than executing branches.

This means that the compiler should structure the generated code so that branches are minimized (using conditional execution when possible), and conditional branches should be generated so that the most common case is not taking the branch.

Taken branches become more expensive as the CPUs get deeper pipelines, and this is especially annoying as loops must branch to the top of the loop for each iteration. This can be solved by adding more hardware to let the fetch unit calculate the target address of the conditional branch, and the taken branch can now be the cheap case.

It is, however, nice to have the “not taken” case be the cheap case, as the alternative often introduce contrived control flow that fragments the instruction cache and need to insert extra “useless” (and expensive) unconditional branches. The way most CPUs solve this is to predict that forward branches are unlikely (and thus speculatively fetch from following instructions), and that backward branches are likely (and thus speculatively fetch from the branch target).

The compiler should do similar work as for the simpler CPU, but structure the code so that conditional branches branching forward are not taken in the common case, and conditional branches branching backward are taken in the common case.

There are many branches that the compiler cannot predict, so the next step up in complexity is adding branch prediction to the CPU. The basic idea is that the CPU keeps a cache of previous branch decisions and use this to predict the current branch. High-end branch predictors look at the history of code flow, and can correctly predict repetitive patterns in how the branch behaved. Hardware vendors do not publish detailed information about how the prediction work, but Agner Fog’s optimization manuals contain lots of information (especially part 3, “The microarchitecture of Intel, AMD and VIA CPUs”, that also have a good overview of different ways branch prediction can be done).

Branch prediction in high-end CPUs is really good, so branches are essentially free, while conditional execution adds extra dependencies between instructions which constrain the out-of-order execution engine, so conditional execution should be avoided. This is essentially the opposite from how the simple CPUs should be handled. 😃

There is one exception — branches that cannot be predicted (such as the one in SciMark) should be generated using conditional instructions, as the branch will incur the misprediction cost of restarting the pipeline each time it is mispredicted.

The compiler should not use conditional execution unless the condition is unpredictable. The code should be structured as for the static prediction (this is not strictly necessary, but most CPUs use the static prediction first time a branch is encountered. And it is also slightly more efficient for the instruction cache).

So branches are free, except when they cannot be predicted. I find it amusing that many algorithms (balanced search trees, etc.) have the aim to make the branches as random as possible. I do not know how much this is a problem in reality, but Clang has a built-in function, __builtin_unpredictable, that can be used to tell the compiler that the condition is unpredictable.


Heuristics for estimating branch probabilities

The compiler estimates branch probabilities in order to generate the efficient form of the branch (there are more optimizations that need to know if a code segment is likely executed or not, such as inlining and loop unrolling). The general idea, as described in the PLDI ’93 paper “Branch Prediction for Free”, is to look at how the branches are used. For example, code such as
if (p == NULL)
    return -1;
comparing a pointer with NULL and returning a constant, is most likely error handling, and thus unlikely to execute the return statement.

GCC has a number of such predictors, for example
  • Branch ending with returning a constant is probably not taken.
  • Branch from comparison using != is probably taken, == is probably not taken.
  • Branch to a basic block calling a cold function is probably not taken.
Each predictor provides a probability (that has been set from branch frequencies observed by instrumenting real world code), and these probabilities are combined for a final result. This is one reason why __builtin_expect often does not make any difference — the heuristics are already coming to the same conclusion!

The predictors are defined in predict.def (some of the definitions seem reversed due to how the rules are implemented, e.g. PROB_VERY_LIKELY may mean “very unlikely”, but the comments describing each heuristic are correct). You can see how GCC is estimating the branch probabilities by passing -fdump-tree-profile_estimate to the compiler, which writes a file containing the output from the predictors for each basic block
Predictions for bb 2
  DS theory heuristics: 1.7%
  combined heuristics: 1.7%
  pointer (on trees) heuristics of edge 2->4: 30.0%
  call heuristics of edge 2->3: 33.0%
  negative return heuristics of edge 2->4: 2.0%
as well as (when using GCC 7.x) the IR annotated with the estimated probabilities.

Sunday, January 8, 2017

GCC code generation for C++ Weekly Ep 43 example

Episode 43 of “C++ Weekly” talks about evaluating and eliminating code at compile time, and the example is fun as it triggers a few different deficiencies in the GCC optimization passes (using the -O3 optimization level with GCC trunk r243987 built for x86_64-linux-gnu).

The example
#include <type_traits>
#include <numeric>
#include <iterator>

template<typename ... T>
int sum(T ... t)
{
  std::common_type_t<T...> array[sizeof...(T)]{ t... };

  return std::accumulate(std::begin(array), std::end(array), 0);
}

int main()
{
  return sum(5,4,3,2,1);
}
is meant to be optimized to return a constant value, and it does
main:
    movl    $15, %eax
    ret
But the behavior varies a lot depending on how many arguments are passed to sum1–5 arguments work as expected, while calling sum with 6 or 7 arguments is generated as
main:
 movdqa .LC0(%rip), %xmm0
 movaps %xmm0, -40(%rsp)
 movl -32(%rsp), %edx
 movl -28(%rsp), %eax
 leal 14(%rdx,%rax), %eax
 ret
where .LC0 is an array consisting of four constants. 9–12 arguments are similar (but with more code). 

13–28 arguments are generated as a constant again
main:
    movl    $91, %eax
    ret

29–64 arguments are optimized to a constant, but with some redundant stack adjustments when the number of arguments is not divisible by four
main:
    subq    $16, %rsp
    movl    $435, %eax
    addq    $16, %rsp
    ret

Finally, 65 and more arguments are generated as a vectorized monstrosity in a separate function, called from main by pushing all the arguments to the stack, one at a time
main:
    subq    $16, %rsp
    movl    $60, %r9d
    movl    $61, %r8d
    pushq   $1
    pushq   $2
    movl    $62, %ecx
    pushq   $3
    ushq    $4
    ...
This is essentially as far from generating a constant as you can come. 😀

The rest of the blog post will look at how GCC is reasoning when trying to optimize this, by examining GCC's internal representation of the program at different points in the optimization pipeline. The IR works essentially as a restricted version of the C language, and you can get GCC to write the IR to a file after each pass by using the command-line option -fdump-tree-all.

1–5 arguments

The use of std::accumulate and iterators expand to five functions, and the compiler starts by inlining and simplifying this to
int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _3;

  <bb 2> [16.67%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;

  <bb 3> [100.00%]:
  # __first_2 = PHI <&array(2), __first_6(4)>
  # __init_4 = PHI <0 __init_5>
  if (&MEM[(void *)&array + 20B] == __first_2)
    goto <bb 5>; [16.67%]
  else
    goto <bb 4>; [83.33%]

  <bb 4> [83.33%]:
  _3 = *__first_2;
  __init_5 = _3 + __init_4;
  __first_6 = __first_2 + 4;
  goto <bb 3>; [100.00%]

  <bb 5> [16.67%]:
  array ={v} {CLOBBER};
  return __init_4;
}
The loop is immediately unrolled and simplified to
int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _15;
  int _20;
  int _25;
  int _30;
  int _35;

  <bb 2> [16.70%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;
  _15 = MEM[(int *)&array];
  __init_16 = _15;
  _20 = MEM[(int *)&array + 4B];
  __init_21 = _15 + _20;
  _25 = MEM[(int *)&array + 8B];
  __init_26 = __init_21 + _25;
  _30 = MEM[(int *)&array + 12B];
  __init_31 = __init_26 + _30;
  _35 = MEM[(int *)&array + 16B];
  __init_36 = __init_31 + _35;
  array ={v} {CLOBBER};
  return __init_36;
}
that is then optimized to a constant by the fre3 (“Full Redundancy Elimination”) pass
int main() ()
{
  <bb 2> [16.70%]:
  return 15;
}

6–12 arguments

The early optimizations that handled the previous case are there mostly to get rid of noise before the heavier optimizations (such as loop optimizations) kicks in, and the loop doing 6 iterations is considered too big to be unrolled by the early unroller.

The “real” loop optimizers determine that the loop is not iterating enough for vectorization to be profitable, so it is just unrolled. The “SLP vectorizer” that vectorizes straight line code is run right after the loop optimizations, and it sees that we are copying constants into consecutive addresses, so it combines four of them to a vector assignment
MEM[(int *)&array] = { 6, 5, 4, 3 };
array[4] = 2;
array[5] = 1;
This is now simplified by the dom3 pass that does SSA dominator optimizations (jump threading, redundancy elimination, and const/copy propagation), but it does not understand that a scalar initialized by a constant vector is a constant, so it only propagates the constants for array[4] and array[5] that were initialized as scalars, and the code passed to the backend looks like
int main() ()
{
  common_type_t array[6];
  int __init;
  int _15;
  int _22;
  int _27;
  int _32;

  <bb 2> [14.31%]:
  MEM[(int *)&array] = { 6, 5, 4, 3 };
  _15 = MEM[(int *)&array];
  _22 = MEM[(int *)&array + 4B];
  __init_23 = _15 + _22;
  _27 = MEM[(int *)&array + 8B];
  __init_28 = __init_23 + _27;
  _32 = MEM[(int *)&array + 12B];
  __init_33 = __init_28 + _32;
  __init_43 = __init_33 + 3;
  array ={v} {CLOBBER};
  return __init_43;
}

13–28 arguments

The loop is now iterated enough times that the compiler determines that vectorization is profitable. The idea behind the vectorization is to end up with something like
tmp = { array[0], array[1], array[2], array[3] }
    + { array[4], array[5], array[6], array[7] }
    + { array[8], array[9], array[10], array[11] };
sum = tmp[0] + tmp[1] + tmp[2] + tmp[3] + array[12];
and the vectorizer generates two loops — one that consumes four elements at a time as long as possible, and one that consumes the remaining elements one at a time. The rest of the loop optimizers know how many times the loops are iterating, so the loops can then be unrolled etc. as appropriate.

The vectorizer is, unfortunately, generating somewhat strange code for the checks that there are enough elements
_35 = (unsigned long) &MEM[(void *)&array + 52B];
_36 = &array + 4;
_37 = (unsigned long) _36;
_38 = _35 - _37;
_39 = _38 /[ex] 4;
_40 = _39 & 4611686018427387903;
if (_40 <= 4)
  goto ; [10.00%]
else
  goto ; [90.00%]
that confuse the rest of the loop optimizations, with the result that the IR contains lots of conditional code of this form. This is not the first time I have seen GCC having problems with the pointer arithmetics from iterators (see bug 78847), and I believe this is the same problem (as the bitwise and should be optimized away when the pointer arithmetics has been evaluated to a constant).

The subsequent passes mostly manage to clean up these conditionals, and dom3 optimizes the vector operations to a constant. But it does not understand the expression used to decide how many scalar elements need to be handled if the iteration count is not a multiple of four (that check is eliminated by the value range propagation pass after dom3 is run), so the scalar additions are kept in the code given to the backend
int main() ()
{
  common_type_t array[13];
  int __init;
  int _23;

   [7.14%]:
  array[12] = 1;
  _23 = MEM[(int *)&array + 48B];
  __init_3 = _23 + 90;
  array ={v} {CLOBBER};
  return __init_3;
}
This is, however, not much of a problem for this program, as the backend manages to optimize this to
main:
    movl    $91, %eax
    ret
when generating the code.

2964 arguments

The backend eliminates the memory accesses to the array in the previous case, but the array seems to be kept on the stack. That makes sense — the code is supposed to be optimized before being passed to the backend, so the backend should not be able to eliminate variables, and there is no need to implement code doing this.

Leaf functions do not need to adjust the stack, but GCC does some stack adjustment on leaf functions too when more than 112 bytes are placed on the stack. You can see this for the meaningless function
void foo()
{
    volatile char a[113];
    a[0] = 0;
}
where the stack is adjusted when the array size is larger than 112.
foo:
    subq    $16, %rsp
    movb    $0, -120(%rsp)
    addq    $16, %rsp
    ret
I do not understand what GCC is trying to do here...

Anyway, passing 29 arguments to sum makes the array large enough that GCC adds the stack adjustments.

65– arguments

The sequence of assignments initializing the array is now large enough that sum is not inlined into main.