Almost useless knowledge

Is almost useless because nobody is going to raise yours salary for that!

Or maybe someone will?!?

Let’s say we ignore all the benefits of all the optimizations our benevolent compiler is going to made on our code, and have a look at what happen when we execute some hundreds of thousands of times two very simple loops:

   //FIRST LOOP:
    for(;index<size;++index){
        *(ptr++) = index;
    }
    //SECOND LOOP:
    ptr = data - 1;
    for(index = 0;index<size;++index){
        *(++ptr) = index;
    }

Is common knowledge that the second loop should be always preferred, you may find lots of discussion on this topic here, in short, the first version require to make a copy of the object being incremented.

Now let’s assume ptr points to data –this is actually where ptr point to-, where data is a plain array of longs, those two loops are just copying the current index value to each successive locations of data, from data+0 to data+size-1.

Let’s benchmark this code and have a look at how it perform in practice (remember, no compiler optimization allowed). I will run those two loops ten thousand times with an array of one million longs, and will collect the average execution time of the first and second loop in microseconds.

Here we have the results:

LOOP
Unexpectedly the first loop is on average 16/18% faster than the second one!

What?

At the beginning of this post I wrote that the first way of looping should be avoided, the post increment operator is generally considered slower and is obviously slower for non trivial types, when copy operations are expensive..

But from this little experiment seems that this is not always the case, even for trivial types! To be noted, reverting the order of the loops doesn’t change the fact that ++ptr is slower than ptr++.

What’s going on here??

To investigate further this problem we need to have a look at the only real programming language, THE ASSEMBLER:

First loop:

    movl    $0, -48(%ebp)     //Loop counter set to 0
    movl    $_data, -12(%ebp) //Pointer to the data array
    movl    %eax, -96(%ebp)
    movl    %edx, -92(%ebp)
    jmp L21
L22:
    // ptr++
    movl    -12(%ebp), %eax   //Get the current address
    leal    4(%eax), %edx     //Calculate the next address
    movl    %edx, -12(%ebp)   //Store the new (next) address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx   //Get the loop counter to edx
    movl    %edx, (%eax)      //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
    addl    $1, -48(%ebp)     //Increase the counter
L21:
    cmpl    $999999, -48(%ebp)
    jle     L22

Second loop:

    movl    %eax, -104(%ebp)
    movl    %edx, -100(%ebp)
    movl    $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
    movl    $0, -48(%ebp)       //Set the loop counter to 0
    jmp L23
L24:
    // ++ptr
    addl    $4, -12(%ebp)       //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
    movl    -12(%ebp), %eax     //Store in eax the address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx     //Store in edx the current loop counter
    movl    %edx, (%eax)        //Move the loop counter value to the current stored address location
    addl    $1, -48(%ebp)       //Increase the loop counter
L23:
    cmpl    $999999, -48(%ebp)
    jle L24

I provided plenty of comments to the code, if for some strange reason you’re not familiar with the venerable assembler language..

Before reading further I must underline the important fact that not-optimized assembler code shouldn’t be analyzed for the sake of drawing conclusion on how C/C++ behave, what we’re going to learn from this experiment is  something about the processor, not the language!

A smart compiler will optimize both the loops in such a way that no real performance difference should be measured, but if the compiler for some reason is not that smart, then you have here and explanation on what’s going on under the cover.

The key difference in the code is between those two sequence of instructions:

  
    // ptr++
    movl    -12(%ebp), %eax 
    leal    4(%eax), %edx   
    movl    %edx, -12(%ebp)
   
    // ++ptr
    addl    $4, -12(%ebp)   
    movl    -12(%ebp), %eax

The faster version have an additional LEAL instruction and one ADDL less, again even looking at those difference ++ptr should reasonably be the faster one (less code), unless leal is much much faster than addl, and so to say, we have two movl + one leal executed faster than one addl and one movl.

Sadly, this is not the correct explanation! The reason is not leal being faster than addl. The explanation is in then way the processor execute the instruction and the significance of them from the memory access point of view.

First of all, once a new memory address is calculated the processor needs to wait about ~6 cycles before being able to address that memory, secondly, operations like movl or any other assembler instruction are split in multiple micro instructions or micro operations, those uops may be potentially executed simultaneously on different cores (when possible), out of order (when possible) or may impact other instruction in the processor pipeline.

The addl from the second loop:

addl    $4, -12(%ebp)

Actually consist of three uops,  load + store address + store data, then we have:

movl    -12(%ebp), %eax

Which is just a load uops, after this call we have 6 cycle latency for %eax to be ready.

In the first loop we have:

movl    -12(%ebp), %eax

load only operation + latency (no dependencies on other load uop), then:

leal    4(%eax), %edx

This is executed on the AGU, from our perspective this is a low latency operation not impacting other uops, then we have:

movl    %edx, -12(%ebp)

Which is a store-data + store-address operation, at this point the first load to %eax should be available for usage.

The cause of the performance difference is because of the dependencies of the addl instruction, that is, the read-modify-write sequence influence the capability of the processor of executing with low latency the next micro operations, remember, the first loop has one load uop less, the addl cause a resource conflict because of this additional load.

So, adding one more load to the first code should cause the two loops to execute with negligible performance difference, and this is exactly what happen when executing this version of the first loop:

   // ptr++
    movl    -12(%ebp), %eax 
    leal    4(%eax), %edx   
    movl    %edx, -12(%ebp)
    movl    -12(%ebp),%edx //Additional instruction

Note that the edx register is going to be overwritten anyway by the next instruction, so this code behave exactly as the previous one, but this time we have an additional load which will cause a delay similar to the one we have in the second loop.

To be noted: addl is not somehow slow by itself, but the load uop will delay the execution of the following load uops!

For details, please have a look at this marvelous resource of information: HERE!

Thanks for reading!

Leave a Reply