Is almost useless because nobody is going to raise yours salary for that!
Or maybe someone will?!?
Let’s say we ignore all the benefits of all the optimizations our benevolent compiler is going to made on our code, and have a look at what happen when we execute some hundreds of thousands of times two very simple loops:
//FIRST LOOP: for(;index<size;++index){ *(ptr++) = index; } //SECOND LOOP: ptr = data - 1; for(index = 0;index<size;++index){ *(++ptr) = index; }
Is common knowledge that the second loop should be always preferred, you may find lots of discussion on this topic here, in short, the first version require to make a copy of the object being incremented.
Now let’s assume ptr points to data –this is actually where ptr point to-, where data is a plain array of longs, those two loops are just copying the current index value to each successive locations of data, from data+0 to data+size-1.
Let’s benchmark this code and have a look at how it perform in practice (remember, no compiler optimization allowed). I will run those two loops ten thousand times with an array of one million longs, and will collect the average execution time of the first and second loop in microseconds.
Here we have the results:

What?
At the beginning of this post I wrote that the first way of looping should be avoided, the post increment operator is generally considered slower and is obviously slower for non trivial types, when copy operations are expensive..
But from this little experiment seems that this is not always the case, even for trivial types! To be noted, reverting the order of the loops doesn’t change the fact that ++ptr is slower than ptr++.
What’s going on here??
To investigate further this problem we need to have a look at the only real programming language, THE ASSEMBLER:
First loop:
movl $0, -48(%ebp) //Loop counter set to 0 movl $_data, -12(%ebp) //Pointer to the data array movl %eax, -96(%ebp) movl %edx, -92(%ebp) jmp L21 L22: // ptr++ movl -12(%ebp), %eax //Get the current address leal 4(%eax), %edx //Calculate the next address movl %edx, -12(%ebp) //Store the new (next) address // rest of the loop is the same as the other movl -48(%ebp), %edx //Get the loop counter to edx movl %edx, (%eax) //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one addl $1, -48(%ebp) //Increase the counter L21: cmpl $999999, -48(%ebp) jle L22
Second loop:
movl %eax, -104(%ebp) movl %edx, -100(%ebp) movl $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte) movl $0, -48(%ebp) //Set the loop counter to 0 jmp L23 L24: // ++ptr addl $4, -12(%ebp) //Calculate the CURRENT address by adding one sizeof(int)==4 bytes movl -12(%ebp), %eax //Store in eax the address // rest of the loop is the same as the other movl -48(%ebp), %edx //Store in edx the current loop counter movl %edx, (%eax) //Move the loop counter value to the current stored address location addl $1, -48(%ebp) //Increase the loop counter L23: cmpl $999999, -48(%ebp) jle L24
I provided plenty of comments to the code, if for some strange reason you’re not familiar with the venerable assembler language..
Before reading further I must underline the important fact that not-optimized assembler code shouldn’t be analyzed for the sake of drawing conclusion on how C/C++ behave, what we’re going to learn from this experiment is something about the processor, not the language!
A smart compiler will optimize both the loops in such a way that no real performance difference should be measured, but if the compiler for some reason is not that smart, then you have here and explanation on what’s going on under the cover.
The key difference in the code is between those two sequence of instructions:
// ptr++ movl -12(%ebp), %eax leal 4(%eax), %edx movl %edx, -12(%ebp) |
// ++ptr addl $4, -12(%ebp) movl -12(%ebp), %eax |
The faster version have an additional LEAL instruction and one ADDL less, again even looking at those difference ++ptr should reasonably be the faster one (less code), unless leal is much much faster than addl, and so to say, we have two movl + one leal executed faster than one addl and one movl.
Sadly, this is not the correct explanation! The reason is not leal being faster than addl. The explanation is in then way the processor execute the instruction and the significance of them from the memory access point of view.
First of all, once a new memory address is calculated the processor needs to wait about ~6 cycles before being able to address that memory, secondly, operations like movl or any other assembler instruction are split in multiple micro instructions or micro operations, those uops may be potentially executed simultaneously on different cores (when possible), out of order (when possible) or may impact other instruction in the processor pipeline.
The addl from the second loop:
addl $4, -12(%ebp)
Actually consist of three uops, load + store address + store data, then we have:
movl -12(%ebp), %eax
Which is just a load uops, after this call we have 6 cycle latency for %eax to be ready.
In the first loop we have:
movl -12(%ebp), %eax
load only operation + latency (no dependencies on other load uop), then:
leal 4(%eax), %edx
This is executed on the AGU, from our perspective this is a low latency operation not impacting other uops, then we have:
movl %edx, -12(%ebp)
Which is a store-data + store-address operation, at this point the first load to %eax should be available for usage.
The cause of the performance difference is because of the dependencies of the addl instruction, that is, the read-modify-write sequence influence the capability of the processor of executing with low latency the next micro operations, remember, the first loop has one load uop less, the addl cause a resource conflict because of this additional load.
So, adding one more load to the first code should cause the two loops to execute with negligible performance difference, and this is exactly what happen when executing this version of the first loop:
// ptr++ movl -12(%ebp), %eax leal 4(%eax), %edx movl %edx, -12(%ebp) movl -12(%ebp),%edx //Additional instruction
Note that the edx register is going to be overwritten anyway by the next instruction, so this code behave exactly as the previous one, but this time we have an additional load which will cause a delay similar to the one we have in the second loop.
To be noted: addl is not somehow slow by itself, but the load uop will delay the execution of the following load uops!
For details, please have a look at this marvelous resource of information: HERE!
Thanks for reading!