Sunday, April 26, 2015

Floating point, precision quaifiers, and optimization

ESSL permits optimizations that may change the value of floating point expressions (lowp and mediump precision change, reassociation of addition/multiplication, etc.), which means that identical expressions may give different results in different shaders. This may cause problems with e.g. alignment of geometry in multi-pass algorithms, so output variables may be decorated with the invariant qualifier to force the compiler to be consistent in how it generates code for them. The compiler is still allowed to do value-changing optimizations for invariant expressions, but it need to do it in the same way for all shaders. This may give us interesting problems if optimizations and code generation are done without knowledge of each other...

Example 1

As an example of the problems we may get with invariant, consider an application that is generating optimized SPIR-V using an offline ESSL compiler, and uses the IR with a Vulkan driver having a simple backend. The backend works on one basic block at a time, and is generating FMA (Fused Multiply-Add) instructions when multiplication is followed by addition. This is fine for invariant, even though FMA changes the precision, as the backend is consistent and always generates FMA when possible (i.e. identical expressions in different shaders will generate identical instructions).

#version 310 es

in float a, b, c;
out invariant float result;

void main() {
float tmp = a * b;
if (c < 0.0) {
result = tmp - 1.0;
} else {
result = tmp + 1.0;
}
}

This is generated exactly as written if no optimization is done; first a multiplication, followed by a compare and branch, and we have two basic blocks doing one addition each. But the offline compiler optimizes this with if-conversion, so it generates SPIR-V as if main was written as
void main()
{
float tmp = a * b;
result = (c < 0.0) ? (tmp - 1.0) : (tmp + 1.0);
}

The optimization has eliminated the branches, and the backend will now see that it can use FMA instructions as everything is in the same basic block.

But the application has one additional shader where main looks like
void main() {
float tmp = a * b;
if (c < 0.0) {
foo();
result = tmp - 1.0;
} else {
result = tmp + 1.0;
}
}

The optimization cannot transform the if-statement here, as the basic blocks are too complex. So this will not use FMA, and will therefore break the invariance guarantee.

Example 2

It is not only invariant expressions that are problematic — you may get surprising results from normal code too when optimizations done offline and in the backend interacts in interesting ways. For example, you can get different precision in different threads from "redundant computation elimination" optimizations. This happens for cases such as
mediump float tmp = a + b;
if (x == 0) {
/* Code not using tmp */
...
} else if (x == 1) {
/* Code using tmp */
...
} else {
/* Code using tmp */
...
}

where tmp is calculated, but not used, for the case "x == 0". The optimization moves the tmp calculation into the two basic blocks where it is used
if (x == 0) {
/* Code not using tmp */
...
} else if (x == 1) {
mediump float tmp = a + b;
/* Code using tmp */
...
} else {
mediump float tmp = a + b;
/* Code using tmp */
...
}
and the backend may now chose to use different precisions for the two mediump tmp calculations.

Offline optimization with SPIR-V

The examples above are of course silly — higher level optimizations should not be allowed to change control flow for invariant statements, and the "redundant computation elimination" does not make sense for warp-based architectures. But the first optimization would have been fine if used with a better backend that could combine instructions from different basic blocks. And not all GPUs are warp-based. That is, it is reasonable to do this kind of optimizations, but they need to be done in the driver where you have full knowledge about the backend and architecture.

My impression is that many developers believe that SPIR-V and Vulkan implies that the driver will just do simple code generation, and that all optimizations are done offline. But that will prevent some optimizations. It may work for a game engine generating IR for a known GPU, but I'm not sure that the GPU vendors will provide enough information on their architectures/backends that this will be viable either.

So my guess is that the drivers will continue to do all the current optimizations on SPIR-V too, and that offline optimizations will not matter...

Thursday, April 9, 2015

Precision qualifiers in SPIR-V

SPIR-V is a bit inconsistent in how it handles types for graphical shaders and compute kernels. Kernels are using sized types, and there are explicit conversions when converting between sizes. Shaders are using 32-bit types for everything, but there are precision decorations that indicates which size is really used, and conversions between sizes are done implicitly. I guess much of this is due to historical reasons in how ESSL defines its types, but I think it would be good to be more consistent in the IR.

ESSL 1 played fast and loose with types. For example, it has an integer type int, but the platform is allowed to implement it as floating point, so it is not necessarily true that "a+1 != a" for a sufficiently large a. ESSL 3 strengthened the type system, so for example high precision integers are now represented as 32-bit values in two's complement form. The rest of this post will use the ESSL 3 semantics.

ESSL does not care much about the size of variables; it has only one integer type "int" and one floating point type "float". But you need to specify which precision to use in calculations by adding precision qualifiers when you declare your variables, such as

highp float x;
Using highp means that the calculations must be done in 32-bit precision, mediump means at least 16-bit precision, and lowp means using at lest 9 bits (yes, "nine". You cannot fit a lowp value in a byte). The compiler may use any size for the variables, as long as the precision is preserved.

So "mediump int" is similar to the int_least16_t type in C, but ESSL permits the compiler to use different precision for different instructions. It can for example use 16-bit precision for one mediump addition, and 32-bit for another, so it is not necessarily true that "a+b == a+b" for mediump integers a and b if the addition overflow 16 bits. The reason for having this semantics is to be able to use the hardware efficiently. Consider for example a processor having two parallel arithmetic units — one 16-bit and one 32-bit. If we have a shader where all instructions are mediump, then we could only reach 50% utilization by executing all instructions as 16-bit. But the backend can now promote half of them to 32-bit and thus be able to double the performance by using both arithmetic units.

SPIR-V is representing this by always using a 32-bit type and decorating the variables and instructions with PrecisionLow, PrecisionMedium, or PrecisionHigh. The IR does not have any type conversions for the precision as the actual type is the same, and it is only the precision of the instruction that differ. But ESSL has requirements on conversions when changing precision in operations that is similar to how size change is handled in other languages:

When converting from a higher precision to a lower precision, if the value is representable by the implementation of the target precision, the conversion must also be exact. If the value is not representable, the behavior is dependent on the type:
• For signed and unsigned integers, the value is truncated; bits in positions not present in the target precision are set to zero. (Positions start at zero and the least significant bit is considered to be position zero for this purpose.)
• For floating point values, the value should either clamp to +INF or -INF, or to the maximum or minimum value that the implementation supports. While this behavior is implementation dependent, it should be consistent for a given implementation
It is of course fine to have the conversions implicit in the IR, but the conversions are explicit for the similar conversion fp32 to fp16 in kernels, so it is inconsistent. I would in general want the shader and kernel IR to be as similar as possible in order to avoid confusion when writing SPIR-V tools working on both types of IR, and I think it is possible to improve this with minor changes:
• The highp precision qualifier means that the compiler must use 32-bit precision, i.e. a highp-qualified type is the same as as the normal non-qualified 32-bit type. So the PrecisionHigh does not tell the compiler anything; it just adds noise to the IR, and can be removed from SPIR-V.
• Are GPUs really taking advantage of lowp for calculations? I can understand how lowp may be helpful for e.g. saving power in varying interpolation, and those cases are handled by having the PrecisionLow decoration on variables. But it seems unlikely to me that any GPU have added the extra hardware to do arithmetic in lowp precision, and I would assume all GPUs use 16-bit or higher for lowp arithmetic. If so, then PrecisionLow should not be a valid decoration for instructions.
• The precision decorations are placed on instructions, but it seems better to me to have the them on the type instead. If PrecisionLow and PrecisionHigh are removed, then PrecisionMedium is the only decoration left. But this can be treated as a normal 16-bit type from the optimizers point of view, so we could instead permit both 32- and 16-bit types for graphical shaders, and specify in the execution model that it is allowed to promote 16-bit to 32-bit. Optimizations and type conversions can then be done in exactly the same way as for kernels, and the backend can promote the types as appropriate for the hardware.

Tuesday, April 7, 2015

Comments on the SPIR-V provisional specification

Below are some random comments/thoughts/questions from my initial reading of the SPIR-V provisional specification (revision 30).

Many of my comments are that the specification is unclear. I may agree that it is obvious what the specification mean, but my experience from specification work is that it is often the case that everybody agree that it is obvious, but they do not agree on what the obvious thing is. So I think the specification need to be more detailed. Especially as one of the goals of SPIR-V is to "be targeted by new front ends for novel high-level languages", and those may generate constructs that are not possible in GLSL or OpenCL C, so it is important that all constraints are documented.

Some other comments are related to tradeoffs. I think the specification is OK, so my comments are mostly highlighting some limitations (and I may have chosen a different tradeoff for some of them...). It would be great to have the rationale described for this kind of decisions.

Const and Pure functions

Functions can be marked as Const or Pure. Const is described as
Compiler can assume this function has no side effects, and will not access global memory or dereference function parameters. Always computes the same result for the same argument values.
while Pure is described as
Compiler can assume this function has no side effect, but might read global memory or read through dereferenced function parameters. Always computes the same result for the same argument values.
I assume the intention is that the compiler is allowed to optimize calls to Const functions, such as moving function calls out of loops, CSE:ing function calls, etc. And similar for the Pure functions, as long as there are no writes to global memory that may affect the result.

But the specification talks about "global memory" without defining what it is. For example, is UniformConstant global variables included in this? Those cannot change, so we can do all the Const optimizations even if the function is reading from them.  And what about WorkgroupLocal? That name does not sound like global memory, but it does of course still prevent optimizations.

I would suggest the specification change to explicitly list the storage classes permitted in Const and Pure functions...

Storage Classes

I'm a bit confused by the Uniform and Function storage classes...

The Uniform storage class is a required capability for Shader. But the GLSL uniform is handled by the UniformConstant storage class, so what is the usage/semantics of Uniform?

Function is described as "A variable local to a function" and is also a required capability for Shader. But OpenCL does also have function-local variables... How are those handled? Why are they not handled in the same way for Shader and Kernel?

Restrict

The Restrict decoration is described as
Apply to a variable, to indicate the compiler may compile as if there is no aliasing.
This does not give you the full picture, as you can express that pointers do not alias as described in the Memory Model section. But pointers have different semantics compared to variables, and that introduces some complications.

OpenCL C defines restrict to work in the same way as for C99, and that is different from the SPIR-V specification. What C99 says is, much simplified, that a value pointed to by a restrict-qualified pointer cannot be modified through a pointer not based on that restrict-qualified pointer. So two pointers can alias if the have the correct "based-on" relationship, and are following some rules on how they are accessed. The frontend may of course decide to not decorate the pointers when it cannot express the semantics in the IR, but it is unclear to me that it is easy to detect the problematic cases.

I think this needs to be clarified along the line of what the LLVM Language Reference Manual does for noalias.

Volatile

There is a Memory Access value Volatile that is described as
This access cannot be optimized away; it has to be executed.
This does not really make sense... The memory model is still mostly TBD in the document, but the principle in GPU programming is that you need atomics or barriers in order to make memory accesses consistent. So there is no way you can observe the difference between the compiler respecting Volatile or not.

My understanding is that the rationale for Volatile in SPIR-V is to be able to work around compiler bugs by decorating memory operations with Volatile and in that way disable some compiler transformations. If so, then I think it would be useful to document this in order to make it more likely that compilers do the right thing. After all, I would expect the project manager to tell the team to do more useful work than fixing a bug for which you cannot see the difference between correct and incorrect behavior.

It has historically been rather common that C compilers miscompile volatile. A typical example is for optimizations such as store forwarding, that substitutes a loaded value by a previously stored value, where the developer forgets to check for volatility when writing the optimization. So a sequence such as
 7:             TypeInt 32 1
15:      7(int) Constant 0
Store 14(tmp) 15
18:      7(int) IMul 16 17
Store 10(a) 18

corresponding to
volatile int tmp = 0;
a = b * tmp;

gets the ID 17 substituted by the constant 0, and is then optimized to
 7:             TypeInt 32 1
15:      7(int) Constant 0
Store 14(tmp) 15
Store 10(a) 15

which is not what it is expected. But you can argue that this actually follows the SPIR-V specification — we have not optimized away the memory accesses!

Volatile and OpenCL

The OpenCL C specification says that
The type qualifiers const, restrict and volatile as defined by the C99 specification are supported.
which I interpret as volatile works in exactly the same way as for C99. And C99 says
An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.
That is, the compiler is not allowed to reorder volatile memory accesses, even if it know that they do not alias. So the definition of the SPIR-V Volatile need to be strengthened if that is meant to be used for implementing the OpenCL volatile. Although I guess you may get around this by a suitable implementation-defined definition of what constitutes an access to an object...

Differences between graphical shaders and OpenCL

The Validation Rules says that for graphical shaders
• Scalar integer types can be parameterized only as:
• – 32-bit signed
– 32-bit unsigned
while OpenCL cannot use signed/unsigned
• OpTypeInt validation rules
– The bit width operand can only be parameterized as 8, 16, 32 and 64 bit.
– The sign operand must always be 0
I guess this lack of signed/unsigned information is the reason why there are Function Parameter Attributes called Zext and Sext described as
Value should be zero/sign extended if needed.
Both choices regarding the signed/unsigned information are fine for an IR, but why is SPIR-V treating graphics and OpenCL differently?

Endianness

Khronos thinks that SPIR-V is an in-memory format, not a file format, which means that the words are stored in the host's native byte order. But one of of the goals of SPIR-V is "enabling shared tools to generate or operate on it", so it will be passed in files between tools. The specification has a helpful hint that you can use the magic number to detect endianness, but that means that all tools need to do the (admittedly simple) extra work to handle both big and little endian.

I think that the specification should define one file format encoding (preferably with a standardized file name extension), and say that all tools should use this encoding.

By the way, are there really any big endian platforms in the target market?