Brooks Patola

Autovectorization in GCC

Today we will take a plunge off the deep end and see what vectorization is all about. Imagine being able to speed up a program you created many many times over, who would not want that? If this is an option you have at your disposal, you would be quite foolish not to take full advantage of the potential performance improvements…and this is where vectorization walks into your life.

Vectorization can be thought of as an optimization technique that will apply loop unrolling to your code and utilize the SIMD(Single Instruction Multiple Data) functionality of your processor. If you’re more mathematically inclined, this refers to the code being converted from a scalar implementation (processes single operands per cycle) to a vector implementation (processes multiple pairs of operands per cycle). When this occurs automagically by the compiler, it can be considered autovectorization.

An example of a simple vectorized loop may look something like this:

non-vectorized loop:

Screen Shot 2017-10-15 at 9.50.58 PM.png

vectorized loop:

Screen Shot 2017-10-15 at 9.51.04 PM.png

This displays the program rewritten to use vector operations on an entire array opposed to operations on individual array elements.

An analogy to help envision why you would be crazy not to take some time to apply vectorization to your code would be how we use highway lanes in daily life. If we have a highway consisting of four lanes for traffic yet there are a thousand cars in just one lane, this can be considered inefficient use of the lanes (scalar implementations) and slow down traffic speed. Opposed to utilizing each lane (vectorization) to speed up the flow of traffic (data).  This is how it works when we leave unused space (code not vectorized) in our SIMD registers that could be used by more data elements. If we have a 128-bit SIMD register (known as V0 – V31) and only use it for one 32-bit integer we are leaving wasted space for three additional integers.

A great article that discusses many of the loops that can be vectorized and drawbacks to vectorization (plus much more) can be found here. If you’d rather sit back and watch a great lecture on these capabilities, James Reinders (who I stole the traffic analogy from) of Intel gives an enlightening one.

So, that all sounds great but how can I inform the compiler to vectorize my code? The flag -ftree-vectorize can be used and vectorization also turns on by default at an -O3 optimization level. If we want information on which loops were vectorized and which were not, we can use the flag -ftree-vectorizer-verbose. For the geeks who are really curious, you can find a list of all SIMD vector instructions here.

Now lets get to an example using our aarchie (AArch64) system…

We are going to write a program that creates two 1000-element integer arrays and fills them with random numbers in the range -1000 to +1000, then sums those two arrays element-by-element to a third array, and finally sums the third array and prints the result.

After we do this we will use our old trusted objdump -d command to see if the compiler is autovectorizing our code for us. And for the sake of testing, lets try a couple different examples of code out.

Lets get to it…

For the first version I have implemented the instructions in a single loop…

Screen Shot 2017-10-20 at 8.44.09 PM.png

Now we will compile using gcc -O3 -o vector1 vector1.c 

I will post the <main> segment of the disassembly for us to see if any vectorization has occurred (along with my comments in green)…

Screen Shot 2017-10-20 at 9.36.16 PM.png

Although we used the -O3 flag which automatically triggers the compiler to utilize vectorization if possible, we can see from above the code has not been vectorized. This may be due to the concept that the compiler will favour a simple loop for vectorization opposed to one that attempts to do to many operations.

Lets see if we can learn anymore useful info before we change our source code…

Hmmm it appears as though after trying -ftree-vectorizer-verbose and setting various levels there was no output to display. Checking some suggestions I tried using -ftree-vectorize -maltivec -ftree-vectorizer-verbose which gave me the same result… nothing to display. This isn’t good news if we want the compiler to tell us whats wrong with out loop for it not to be vectorized…

After some more searching I found a flag that seems to work and provide us with some information to aid us as to why this loop didn’t become vectorized…

-ftree-vectorize -fopt-info-vec-all

Screen Shot 2017-10-20 at 9.49.08 PM.png

Here we can see that it complains “loop contains function calls or data references that cannot be analyzed”.

Lets give it another try with a tweak in our code, this time not encapsulating all the instructions in a single loop.

Screen Shot 2017-10-20 at 10.30.43 PM.png

and the objdump

1stvec

2ndvec.png

It looks like we have some partial vectorization happening. This is much better than no vectorization at all, but lets check and see what is happening from the compilers perspective with -ftree-vectorize -fopt-info-vec-all

Whoa……there’s tons of information pertaining to many instructions within the code, way too much to put up here. I will post the most significant piece of information though which shows us what loop was vectorized…

Screen Shot 2017-10-21 at 1.28.46 AM.png

As we can see from this the second for loop was vectorized but not the first. Lets see if we can dig deeper with -ftree-vectorize -fopt-info-vec-missed

Screen Shot 2017-10-21 at 1.30.27 AM.png

It appears the calls to rand() may be to blame for the initial loop not vectorizing properly. There are many iterations with many possible alternating values on each pass of the loop with that function call, which defeats the purpose of vectorization in theory. As you can see, there are also some “misalign” notes among others.

After playing around with some suggestions for leading the compiler towards further vectorizing the code with __builtin_assume_aligned and some more simplification of the loops, nothing would work to further vectorize from the past example. I broke the for loops into three separate statements, with a call to rand() both having their own for loop for loading the arrays, and a final one for calculating the sum. When I looked at the missed vectorization opportunities I learned that the last for loop (sum calculation) vectorized properly while the two loops with the call to rand() did not. This further supports my theory that the random function generator is to blame for not having the vectorized code as well as the statement…

Screen Shot 2017-10-21 at 1.58.27 AM.png

Although the path to vectorization may seem complicated and painstaking, the opportunities for the performance gain are still an admirable reach to have. Personally, the most tedious task was looking up the many different assembly instructions to see what they were doing at the machine level as this is not really human intuitive code. Apart from that, vectorization can lead you down a path of continuous knowledge and learning, and endless rabbit hole it seems…which is something I like getting lost into from time to time. Vectorization will be on my list of concepts to indulge further time into learning and will hopefully lead me to be able to utilize it in my own code soon.

Leave a Comment