In this case, I promise that the pointer declared along with the restrict qualifier is not aliased. I certify that writes through this pointer will not effect the values read through any other pointer available in the same context which is also declared as restricted.
* Your agreement to this contract is implied by use of the restrict keyword ;)
Read on for more information on the practical use and benefits to using the restrict keyword...
The restrict keyword is a type qualifier for pointers and is a formal part of the C99 standard.
Example usage:
In other words, proper use of the restrict keyword gives the compiler enough information to select a more optimal order of loads and stores to/from memory and to potentially make better use of registers to store non-aliased objects.
typedef struct vector3 vector3;What follows is a simple example function that updates some "particles" with unrestricted pointers. Note that the pointers share the same type, so the compiler will assume they can be aliased, per the strict aliasing rule.
struct vector3
{
float x;
float y;
float z;
};
void
move( vector3* velocity,
vector3* position,
vector3* acceleration,
float time_step,
size_t count )
{
for (size_t i=0;i<count;i++)
{
velocity[i].x += acceleration[i].x * time_step;
velocity[i].y += acceleration[i].y * time_step;
velocity[i].z += acceleration[i].z * time_step;
position[i].x += velocity[i].x * time_step;
position[i].y += velocity[i].y * time_step;
position[i].z += velocity[i].z * time_step;
}
}
# This code was compiled with GCC 3.4.1 for PowerPC,Notice above that position must wait for velocity to be stored. This is because the compiler cannot gaurantee that the two are not aliased and must assume that the write to velocity can overwrite the location where position will be read. Because the compiler must effectively perform the operations in the order declared in the source, it must assume this is the behavior the programmer intended.
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
cmpwi 0,6,0
stwu 1,-16(1)
beq- 0,.L7
li 8,0
mtctr 6
.L8:
add 9,8,3
lfsx 13,8,5
add 10,8,5
lfsx 0,8,3
lfs 8,4(9)
add 11,8,4
lfs 5,8(10)
lfs 7,4(10)
lfs 6,8(9)
fmadds 4,13,1,0
fmadds 3,7,1,8
fmadds 2,5,1,6
stfsx 4,8,3 # Store velocity_x
stfs 3,4(9) # Store velocity_y
stfs 2,8(9) # Store velocity_z
lfsx 11,8,4 # Load position_x
lfs 10,4(11) # Load position_y
lfs 9,8(11) # Load position_z
fmadds 12,4,1,11
fmadds 0,3,1,10
fmadds 13,2,1,9
stfsx 12,8,4
addi 8,8,12
stfs 0,4(11)
stfs 13,8(11)
bdnz .L8
.L7:
addi 1,1,16
blr
voidThe use of restricted pointers would specifically disallow this.
call_move( vector3* some_data, float time_step, count )
{
move( some_data, some_data, some_data, time_step, count );
}
Compare this to the same function working with arrays of file scope. Working with file scope arrays represents the best case for the compiler with regard to alias analysis and should be used as the baseline for implementing functions with restricted pointers.
vector3 velocity [ PARTICLE_COUNT ];With the above code the compiler knows the arrays will be stored seperately and can determine that they are three independent data windows, or stripes and there can be no aliasing among them. A data stripe can be thought of as a data channel made up of indexable elements.
vector3 position [ PARTICLE_COUNT ];
vector3 acceleration [ PARTICLE_COUNT ];
void
move( float time_step )
{
for (size_t i=0;i<PARTICLE_COUNT;i++)
{
velocity[i].x += acceleration[i].x * time_step;
velocity[i].y += acceleration[i].y * time_step;
velocity[i].z += acceleration[i].z * time_step;
position[i].x += velocity[i].x * time_step;
position[i].y += velocity[i].y * time_step;
position[i].z += velocity[i].z * time_step;
}
}
Data Channel | Channel Elements (by Index) |
---|---|
velocity | [0] ---> [1] ---> [2] ---> [N] |
position | [0] ---> [1] ---> [2] ---> [N] |
acceleration | [0] ---> [1] ---> [2] ---> [N] |
# This code was compiled with GCC 3.4.1 for PowerPC,All the stores are completed at the end of the loop. More specifically, the load for position is scheduled before the store of velocity. This validates that the compiler has enough information to determine that the values stored do not alias the values loaded.
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
lis 3,velocity@ha
lis 11,acceleration@ha
lis 9,position@ha
la 6,velocity@l(3)
la 5,acceleration@l(11)
la 7,position@l(9)
li 8,0
stwu 1,-16(1)
li 0,8192
mtctr 0
.L18:
add 12,8,6
lfsx 12,8,6 # Load velocity + 0
add 10,8,5
lfsx 13,8,5 # Load acceleration + 0
lfs 8,4(12) # Load velocity + 4
add 4,8,7
lfs 5,8(10) # Load acceleration + 8
lfs 6,8(12) # Load velocity + 8
lfs 7,4(10) # Load acceleration + 4
fmadds 9,13,1,12
fmadds 10,7,1,8
fmadds 11,5,1,6
lfsx 4,8,7 # Load position + 0
lfs 3,4(4) # Load position + 4
lfs 2,8(4) # Load position + 8
fmadds 0,9,1,4
fmadds 13,10,1,3
fmadds 12,11,1,2
stfsx 9,8,6 # Store velocity + 0
stfs 11,8(12) # Store velocity + 8
stfs 10,4(12) # Store velocity + 4
stfsx 0,8,7 # Store position + 0
addi 8,8,12
stfs 13,4(4) # Store position + 4
stfs 12,8(4) # Store position + 8
bdnz .L18
addi 1,1,16
blr
In order to get this same behavior with non-file scope pointers, use the restrict keyword to declare that every location which is either loaded or stored has no aliases.
voidNine (9) non-aliased memory stipes were declared in the above code. This completely defines the aliasing relationships between all the loads and stores.
move( vector3* velocity,
vector3* position,
vector3* acceleration,
float time_step,
size_t count,
size_t stride )
{
float* restrict acceleration_x = &acceleration->x;
float* restrict velocity_x = &velocity->x;
float* restrict position_x = &position->x;
float* restrict acceleration_y = &acceleration->y;
float* restrict velocity_y = &velocity->y;
float* restrict position_y = &position->y;
float* restrict acceleration_z = &acceleration->z;
float* restrict velocity_z = &velocity->z;
float* restrict position_z = &position->z;
for (size_t i=0;i<count*stride;i+=stride)
{
velocity_x[i] += acceleration_x[i] * time_step;
velocity_y[i] += acceleration_y[i] * time_step;
velocity_z[i] += acceleration_z[i] * time_step;
position_x[i] += velocity_x[i] * time_step;
position_y[i] += velocity_y[i] * time_step;
position_z[i] += velocity_z[i] * time_step;
}
}
Data Channel | Channel Elements (by Index) |
---|---|
velocity_x | [0] ---> [1] ---> [2] ---> [N] |
velocity_y | [0] ---> [1] ---> [2] ---> [N] |
velocity_z | [0] ---> [1] ---> [2] ---> [N] |
position_x | [0] ---> [1] ---> [2] ---> [N] |
position_y | [0] ---> [1] ---> [2] ---> [N] |
position_z | [0] ---> [1] ---> [2] ---> [N] |
acceleration_x | [0] ---> [1] ---> [2] ---> [N] |
acceleration_y | [0] ---> [1] ---> [2] ---> [N] |
acceleration_z | [0] ---> [1] ---> [2] ---> [N] |
By copying addresses from from pointer to another, an implicit hierarchy (or tree) of pointers is created. The child pointers are usually completely aliased by the parent pointer and it's important not to use them both at the same time (i.e. in the same scope). When restricted child pointers are created, consider the parent pointer to be out of scope and do not make an accesses through it. Note that in this case, any use of velocity, position or acceleration would invalidate the restrict contract and the results would be undefined.
|---> velocity_x
velocity -------|---> velocity_y
|---> velocity_z
|---> position_x
position -------|---> position_y
|---> position_z
|---> acceleration_x
acceleration ---|---> acceleration_y
|---> acceleration_z
# This code was compiled with GCC 3.4.1 for PowerPC,This version has all the flexibility of the first (unrestricted) version and the performance of the second (file scope arrays) version. You should expect code where all aliasing information is declared with the restrict keyword to almost always perform significantly better, and never worse, than with unrestricted pointers. This is especially true on superscalar RISC, or RISC-like architectures with large register files, like the PowerPC or MIPS R4000.
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
stwu 1,-32(1)
stw 31,28(1)
mullw 31,6,7
stw 30,24(1)
cmplwi 7,31,0
mr 30,7
addi 12,3,4
addi 6,5,4
addi 8,4,4
addi 7,5,8
addi 10,3,8
addi 11,4,8
li 9,0
ble- 7,.L27
.L31:
slwi 0,9,2
lfsx 13,3,0 # Load velocity_x
add 9,9,30
lfsx 8,12,0 # Load velocity_y
cmplw 7,31,9
lfsx 6,10,0 # Load velocity_z
lfsx 12,5,0 # Load acceleration_x
lfsx 7,6,0 # Load acceleration_y
lfsx 5,7,0 # Load acceleration_z
fmadds 11,12,1,13
fmadds 10,7,1,8
fmadds 9,5,1,6
lfsx 4,4,0 # Load position_x
lfsx 3,8,0 # Load position_y
lfsx 2,11,0 # Load position_z
fmadds 0,11,1,4
fmadds 13,10,1,3
fmadds 12,9,1,2
stfsx 11,3,0 # Store velocity_x
stfsx 10,12,0 # Store velocity_y
stfsx 9,10,0 # Store velocity_z
stfsx 0,4,0 # Store position_x
stfsx 13,8,0 # Store position_y
stfsx 12,11,0 # Store position_z
bgt+ 7,.L31
.L27:
lwz 30,24(1)
lwz 31,28(1)
addi 1,1,32
blr
The asute reader may also have noticed that because nine (9) restricted stripes were used instead of three (3) file scope arrays, the compiler has been able to select a much simplier addressing scheme. Much of the pointer arithmetic has been hoisted out of the loop. The version with the restricted pointers is actually more efficient than the one with file scope arrays.
For example, the following setup would be a completely valid use of restricted pointers:
struct particleAlthough each stripe of data is part of the same "object", none of the accesses would be aliased. Some runtime systems try to determine whether or not pointers are aliased by simply checking to see if the memory windows overlap. That is not sufficient.
{
vector3 position;
vector3 velocity;
vector3 acceleration;
};
[ ... ]
void
call_move( particle* particles, float time_step, count )
{
move( &particles->position,
&particles->velocity,
&particles->acceleration,
time_step,
count,
sizeof(particle) );
}
voidNot publishing aliasing assumptions will lead to very difficult to find bugs. Programmers will not know that the data must be independent and someone, someday will find a reason to use the same array in two or more pointers.
move( vector3* restrict velocity,
vector3* restrict position,
vector3* restrict acceleration,
float time_step,
size_t count,
size_t stride );
Take for example memcpy, which has been officially changed to have the following declaration:
void*Can you guess why?
memcpy(void* restrict s1,
const void* restrict s2,
size_t n );
{
vector3* restrict position = &obj_a->position;
float* restrict position_x = &position->x; <-- UNDEFINED
{
float* restrict position_y = &position->y; <-- VALID
}
}
There is one additional problem in the assembly output above which is somewhat particular to the GCC scheduler. Notice that the load for position happens immediately before its update and store. The first multiply-add will stall waiting the first load to be completed before executing. The first float (position_x) will not be ready in three (3) cycles. It would be considerably better (and faster) if the load could be pushed closer to the top of the loop so that it is more likely to be completed by the time it is needed.
lfsx 4,4,0 # Load position_xDue to the order in which scheduling is done in GCC, it is always better to simplify expressions. Do not mix memory access with calculations. The code can be re-written as follows:
lfsx 3,8,0 # Load position_y
lfsx 2,11,0 # Load position_z
fmadds 0,11,1,4 # Update position_y
fmadds 13,10,1,3 # Update position_x
fmadds 12,9,1,2 # Update position_z
void
move( vector3* restrict velocity,
vector3* restrict position,
vector3* restrict acceleration,
float time_step,
size_t count,
size_t stride )
{
float* restrict acceleration_x = &acceleration->x;
float* restrict velocity_x = &velocity->x;
float* restrict position_x = &position->x;
float* restrict acceleration_y = &acceleration->y;
float* restrict velocity_y = &velocity->y;
float* restrict position_y = &position->y;
float* restrict acceleration_z = &acceleration->z;
float* restrict velocity_z = &velocity->z;
float* restrict position_z = &position->z;
for (size_t i=0;i<count*stride;i+=stride)
{
const float ax = acceleration_x[i];
const float ay = acceleration_y[i];
const float az = acceleration_z[i];
const float vx = velocity_x[i];
const float vy = velocity_y[i];
const float vz = velocity_z[i];
const float px = position_x[i];
const float py = position_y[i];
const float pz = position_z[i];
const float nvx = vx + ( ax * time_step );
const float nvy = vy + ( ay * time_step );
const float nvz = vz + ( az * time_step );
const float npx = px + ( vx * time_step );
const float npy = py + ( vy * time_step );
const float npz = pz + ( vz * time_step );
velocity_x[i] = nvx;
velocity_y[i] = nvy;
velocity_z[i] = nvz;
position_x[i] = npx;
position_y[i] = npy;
position_z[i] = npz;
}
}
# This code was compiled with GCC 3.4.1 for PowerPC,The loads are now properly scheduled and moved as far in advance as possible. The pattern [Load --> Update --> Store] is usually the optimal pattern for simple memory transformations on a superscalar RISC-like architecture, and is exactly what is being emitted. This is reasonably close to good hand-written assembly for the same code (without re-defining the problem), and the code now very suitable for unrolling.
# with the following options: -O3 -fstrict-aliasing -std=c99
#
move:
stwu 1,-32(1)
stw 31,28(1)
mullw 31,6,7
stw 30,24(1)
cmplwi 7,31,0
mr 30,7
addi 12,3,4
addi 6,5,4
addi 8,4,4
addi 7,5,8
addi 10,3,8
addi 11,4,8
li 9,0
ble- 7,.L47
.L51:
slwi 0,9,2
lfsx 8,3,0 # Load vx
add 9,9,30
lfsx 7,12,0 # Load vy
cmplw 7,31,9
lfsx 6,10,0 # Load vz
lfsx 10,4,0 # Load px
lfsx 9,8,0 # Load py
lfsx 5,11,0 # Load pz
lfsx 4,5,0 # Load ax
lfsx 3,6,0 # Load ay
lfsx 2,7,0 # Load az
fmadds 0,8,1,10 # Update npx
fmadds 13,7,1,9 # Update npy
fmadds 12,6,1,5 # Update npz
fmadds 11,4,1,8 # Update nvx
fmadds 10,3,1,7 # Update nvy
fmadds 9,2,1,6 # Update nvz
stfsx 0,4,0 # Store npx
stfsx 13,8,0 # Store npy
stfsx 12,11,0 # Store npz
stfsx 11,3,0 # Store nvx
stfsx 10,12,0 # Store nvy
stfsx 9,10,0 # Store nvz
bgt+ 7,.L51
.L47:
lwz 30,24(1)
lwz 31,28(1)
addi 1,1,32
blr
- Strict aliasing means that two objects of different types cannot refer to the same location in memory. Enable this option in GCC with the -fstrict-aliasing flag. Be sure that all code can safely run with this rule enabled. Enable strict aliasing related warnings with -Wstrict-aliasing, but do not expect to be warned in all cases.
- Compare the assembly output of the function with restricted pointers and file scope arrays to ensure that all of the possible aliasing information has been used.
- Only use restricted leaf pointers. Use of parent pointers may break the restrict contract.
- Publish as many assumptions as possible about aliasing information in the function declaration.
- Memory windows may be overlapping and still be without aliases. Do not limit the data design to non-overlapping windows.
- Begin using the restrict keyword immediately. Retrofit old code as soon as possible.
- Keep loads and stores separated from calculations. This results in better scheduling in GCC, and makes the relationship between the output assembly and the original source clearer.