Poor C code generation

Asked by Gary Fuehrer

I'm grateful for the efforts of everyone involved. I know that I couldn't do any of it myself. And now, my rant.

I've been using the toolchains from this site for the last 8 months, so 4_7-2013q1 thru 4_8-2013q4. I'm seeing some minor variations in the quality of the generated machine code, but none of the strides toward what I would call an okay result. Below I've included one example for discussion fodder.

Am I missing something, like a command line option (see example below)?

If not then why is this not progressing? I can think of plenty of plausible good reasons, like focus of limited effort sunk into things like C++11 feature completeness, that no one is noticing what is happening anymore at the lowest level, or that everyone thinks the generated code is just fine the way it is (so shut up). But I'd like to hear from non-speculative sources.

Here is an example that I would argue is so fundamental that any C compiler should be able to produce the 'best' possible assembly (in this -Os case, meaning 'smallest'):

crap0.c:
#include <stdint.h>

extern uint32_t __bss_start[];
extern uint32_t __data_start[];

void Reset_Handler(void)
{
 /* Clear .bss section (initialize with zeros) */
 for (uint32_t* bss_ptr = __bss_start; bss_ptr != __data_start; ++bss_ptr) {
  *bss_ptr = 0;
 }
}
>[...]\arm-none-eabi-gcc -c -std=c11 -mcpu=cortex-m4 -mthumb -Os -Wall crap0.c -o crap0.o

The latest version (4.8.3) yields the following:
0800018c <Reset_Handler>:
 800018c: 4a06 ldr r2, [pc, #24] ; (80001a8 <Reset_Handler+0x1c>)
 800018e: 4907 ldr r1, [pc, #28] ; (80001ac <Reset_Handler+0x20>)
 8000190: 1a89 subs r1, r1, r2
 8000192: f021 0103 bic.w r1, r1, #3
 8000196: 2300 movs r3, #0
 8000198: 428b cmp r3, r1
 800019a: d003 beq.n 80001a4 <Reset_Handler+0x18>
 800019c: 2000 movs r0, #0
 800019e: 50d0 str r0, [r2, r3]
 80001a0: 3304 adds r3, #4
 80001a2: e7f9 b.n 8000198 <Reset_Handler+0xc>
 80001a4: 4770 bx lr
 80001a6: bf00 nop
 80001a8: 20000000 .word 0x20000000
 80001ac: 20000030 .word 0x20000030
 80001b0: [...]

First off, the compiler is now (vs 4.7.x) including an additional wide instruction (bic.w) because, why, it no longer trusts that 32bit ints are 4-byte aligned? Yuck! Next I see that the compiler now recognizes that it can work with an integer index, but then it squanders the opportunities opened up in so doing -- instead of counting toward zero it counts away from it necessitating both an additional register and a compare instruction in the inner loop. Finally, a forth register is utilized for the zero being stored so as to prevent a reload of the array start address at each iteration, but then that load of zero is left within the loop!

Unless the index is needed somewhere, the result should have this form (note the inner loop is 50% shorter):
0800018c <Reset_Handler>:
 800018c: ldr r1, [pc, #16] ; (80001a0)
 800018e: ldr r2, [pc, #20] ; (80001a4)
 8000190: subs r1, r1, r2
 8000192: beq.n 800019c
 8000194: movs r0, #0
 8000196: str r0, [r2, r1]
 8000198: adds r1, #4
 800019a: bne.n 8000196
 800019c: bx lr
 800019e: nop
 80001a0: 20000000 .word 0x20000000
 80001a4: 20000030 .word 0x20000030
 80001a8: [...]

In my experience, it's a fools errand trying to write in C what one hopes that the compiler will generate -- the frontend optimizations shred any optimizing ideas. But in this rare case the result is close:
#include <stdint.h>

crap1.c:
#include <stdint.h>

extern uint32_t __bss_start[];
extern uint32_t __data_start[];

void Reset_Handler(void)
{
 /* Clear .bss section (initialize with zeros) */
 uint8_t* bss_end = (uint8_t*)__data_start;
 uint32_t index = (uint8_t const*)__bss_start - bss_end;
 if (index != 0)
 {
  uint32_t zero = 0;
  do {
   // BADNESS: cast changes alignment requirements
   *(uint32_t*)(&bss_end[index]) = zero;
   index += 4;
  } while (index != 0);
 }
}

0800018c <Reset_Handler>:
 800018c: 4b04 ldr r3, [pc, #16] ; (80001a0 <Reset_Handler+0x14>)
 800018e: 4a05 ldr r2, [pc, #20] ; (80001a4 <Reset_Handler+0x18>)
 8000190: 1a98 subs r0, r3, r2
 8000192: d003 beq.n 800019c <Reset_Handler+0x10>
 8000194: 2100 movs r1, #0
 8000196: 5011 str r1, [r2, r0]
 8000198: 3004 adds r0, #4
 800019a: e7fa b.n 8000192 <Reset_Handler+0x6>
 800019c: 4770 bx lr
 800019e: bf00 nop
 80001a0: 20000000 .word 0x20000000
 80001a4: 20000030 .word 0x20000030
 80001a8: [...]

Nevertheless, notice how the zero loading was pushed into the loop and the how the loop was enlarged even further to stupidly share the conditional branch instruction at 0x8000192, all for a net anti-optimization of the original C code.

Does anyone working on gcc care about this kind of stuff? If not, no sweat, I'll not bother anymore.

-Gary

Question information

Language:
English Edit question
Status:
Solved
For:
GNU Arm Embedded Toolchain Edit question
Assignee:
No assignee Edit question
Solved by:
Gary Fuehrer
Solved:
Last query:
Last reply:
Revision history for this message
Terry Guo (terry.guo) said :
#1

Really appreciate you to try our tool chain and share thoughts. We need more people like you and we do care this kind of stuff. I will look into this and get back to you later.

Revision history for this message
Joey Ye (jinyun-ye) said :
#2

Gary,

First of all thanks for reporting this issue with so many details. As owner of this project we do care not only stability but also code size and performance. As to the issues you raised I'll try my best to cover all.

There are a number of issues in your example as well as in compiler.

Issue 1 in the example is the condition bss_ptr != data_start, which isn't a typical robust loop ending condition. By changing it to bss_ptr < data_start, 4.8 gets a much better result as it can analyze loop boundary better.

Issue 2 in the example is -Os option you used, which only optimize for code size. Then you shouldn't expect GCC to do enough work to hoisting the loop invarant (movs r3, #0) out of loop, as it doesn't help reducing code size at all but with a risk of raising register pressure to increase code size.

Issues in compiler is that 4.8 doesn't analyze the loop structure well enough to optimize for size and performance. 4.9 is doing much better after extensive loop optimizations Bin Cheng recently applied. BTW with 4.9 I get:

   0: 4b04 ldr r3, [pc, #16] ; (14 <Reset_Handler+0x14>)
   2: 4a05 ldr r2, [pc, #20] ; (18 <Reset_Handler+0x18>)
   4: 2100 movs r1, #0
   6: 4293 cmp r3, r2
   8: d002 beq.n 10 <Reset_Handler+0x10>
   a: f843 1b04 str.w r1, [r3], #4
   e: e7fa b.n 6 <Reset_Handler+0x6>
  10: 4770 bx lr
Even better than your hand written version.

As to answer your questions:
1. Additional bic: the reason of difference between 4.8 and 4.7 is under investigation. A note should be sent out later to explain it.

2. Not counting to zero: compiler ususally don't do such a transformation that may result in side effect. Imagine you have a piece of system code that first half of a buffer is writable and second half isn't. The expected behavior will be an exception raised after first half of buffer filled. If compiler reverse it the behavior will be reversed, which won't be what programmer expected. As you can see in the 4.9 that even without reversing compiler can still generated very efficient code

3. Reloading of value: refer to issue 2 in the example

My suggestion:
1. Change != to <, and you will immediately get better code with 4.8
2. What for 4.9 that can generate best code even if you don't make change above

Thanks,
Joey

Revision history for this message
chengbin (can-finner) said :
#3

One more point.

The latest version (4.8.3) yields the following:
0800018c <Reset_Handler>:
 800018c: 4a06 ldr r2, [pc, #24] ; (80001a8 <Reset_Handler+0x1c>)
 800018e: 4907 ldr r1, [pc, #28] ; (80001ac <Reset_Handler+0x20>)
 8000190: 1a89 subs r1, r1, r2
 8000192: f021 0103 bic.w r1, r1, #3
 8000196: 2300 movs r3, #0
 8000198: 428b cmp r3, r1
 800019a: d003 beq.n 80001a4 <Reset_Handler+0x18>
 800019c: 2000 movs r0, #0
 800019e: 50d0 str r0, [r2, r3]
 80001a0: 3304 adds r3, #4
 80001a2: e7f9 b.n 8000198 <Reset_Handler+0xc>
 80001a4: 4770 bx lr
 80001a6: bf00 nop
 80001a8: 20000000 .word 0x20000000
 80001ac: 20000030 .word 0x20000030
 80001b0: [...]

The code looks really bad. The biggest problem is GCC IVOPT chooses wrong induction variable candidate with arm's new cost model. There are several IVOPT patches applied in trunk can help GCC choose right candidate.

Also, the code generated by trunk can still be improved into:

 ldr r3, [pc, #16] ; (14 <Reset_Handler+0x14>)
 ldr r2, [pc, #20] ; (18 <Reset_Handler+0x18>)
 movs r1, #0
 b .L2
.L1
 str.w r1, [r3], #4
.L2
 cmp r3, r2
 bne .L1
 bx lr

Thanks,
bin

Revision history for this message
Gary Fuehrer (gfuehrer) said :
#4

Thank you Terry, Joey, and Bin for your responses. I'm relieved that this stuff is important to some, yay!

I tried the voodo with the "<" instead of "!=" and yes, the shorter-over-all but slower-running 4.7.x result came back.

I am greatly cheered that in the works is a gcc version that might do:

 for (uint32_t* p = __16; p != __20; ++p) {
 *p = 0;
 }
==>
 ldr r3, [pc, #16] ; (14 <Reset_Handler+0x14>)
 ldr r2, [pc, #20] ; (18 <Reset_Handler+0x18>)
 movs r1, #0
 b .L2
.L1
 str.w r1, [r3], #4
.L2
 cmp r3, r2
 bne .L1
 bx lr

This form has the advantage of pleasingly being a direct reflection of the C -- pointer increments from beginning to end -- while being compact over all and speedy in operation. Sweet. I agree that it is preferred over the assembly that I had posted.

However, just to be pedantic, mine does not, as Joey suggested, reverse the order of addresses written. Both are 18 instruction bytes long (plus 2 for alignment and 8 for ldr pool). But mine executes the load of zero only if necessary. When not, it is done 2 instructions sooner. And it has fewer instruction bytes in the loop. Also, don't forget that it pleasingly is a direct reflection of the crap1.c that I posted.

- Gary