Bad optimization for tiny wrappers

Asked by Andreas Fritiofson

Hi!

I noticed some bad optimization in some very small wrappers that basically just reorder the parameters. The following is compiled using

arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -Os -c test.c -o test.o

int f2(int b, int a);

int f(int a, int b)
{
 return f2(b, a);
}

00000000 <f>:
   0: 4603 mov r3, r0
   2: 4608 mov r0, r1
   4: 4619 mov r1, r3
   6: f7ff bffe b.w 0 <f2>

So that's as optimal as it gets. Let's apply some register pressure:

int f4(int b, int a, int c, int d);

int g(int a, int b, int c, int d)
{
 return f4(b, a, c, d);
}

0000000a <g>:
   a: b510 push {r4, lr}
   c: 4604 mov r4, r0
   e: 4608 mov r0, r1
  10: 4621 mov r1, r4
  12: e8bd 4010 ldmia.w sp!, {r4, lr}
  16: f7ff bffe b.w 0 <f4>

First issue... Why does it push and restore lr here? At -O2 it only pushes r4. Not only does it waste two cycles but it also requires the use of a wide load so it's actually larger than it has to be.

In this particular case it's still not optimal without the lr push. Let's try to be clever:

int h(int a, int b, int c, int d)
{
 a ^= b;
 b ^= a;
 a ^= b;
 return f4(a, b, c, d);
}

0000001a <h>:
  1a: b510 push {r4, lr}
  1c: 4604 mov r4, r0
  1e: 4608 mov r0, r1
  20: 4621 mov r1, r4
  22: e8bd 4010 ldmia.w sp!, {r4, lr}
  26: f7ff bffe b.w 0 <f4>

Wow! The compiler outsmarted us. It's the exact same result as g. Impressive deduction but completely counter-productive since the most trivial translation would be both smaller and faster:

int i(int a, int b, int c, int d)
{
 asm volatile (
  "eors %0, %1 \n"
  "eors %1, %0 \n"
  "eors %0, %1 \n"
 : "=r"(a), "=r"(b));
 return f4(a, b, c, d);
}

0000002a <i>:
  2a: 4048 eors r0, r1
  2c: 4041 eors r1, r0
  2e: 4048 eors r0, r1
  30: f7ff bffe b.w 0 <f4>

Is the compiler not tracking the fact that the condition codes do not need to be preserved in this context, or is there another reason it's trying so hard to avoid the XOR trick?

By the way, these types of parameter permutations are common in newlib, because for some reason it inserts the reentrancy pointer as the first argument to the reentrant syscalls. For an example of lousy optimization, check the disassembly of <write> in lib/armv7-m/libc.a and lib/armv7-m/libc_s.a. Neither is optimal but the size optimized is actually *larger*.

Question information

Language:
English Edit question
Status:
Solved
For:
GNU Arm Embedded Toolchain Edit question
Assignee:
No assignee Edit question
Solved by:
Joey Ye
Solved:
Last query:
Last reply:
Revision history for this message
Terry Guo (terry.guo) said :
#1

Thanks for reporting. I can reproduce what you said. Let me try to answer two of your questions:

1). First issue... Why does it push and restore lr here? At -O2 it only pushes r4. Not only does it waste two cycles but it also requires the use of a wide load so it's actually larger than it has to be.
Terry>> I agree that we don't need to push/restore LR here at Os level. We don't do so at O2 level. But this has nothing to do with wide load, both O2 and Os use 32bit instructions. I don't think there is code size save here.

2). Is the compiler not tracking the fact that the condition codes do not need to be preserved in this context, or is there another reason it's trying so hard to avoid the XOR trick?
Terry>>I think the 'asm volatile' prevents GCC from considering the condition codes here. The XOR trick is a more optimal choice here in this case. I will record it as an optimization chance to see if we can implement it.

Revision history for this message
Andreas Fritiofson (andreas-fritiofson) said :
#2

Thanks for the response!

What I meant with the wide load was that, if I'm not mistaken, "pop {r4}" can be encoded using a 16-bit instruction, while "pop {r4, lr}" requires 32 bits. So skipping lr would allow the compiler, as a side effect, to save two bytes in some cases. On the other hand it seems that even in the O2 case without lr, it opts to use the wide encoding.

What puzzles me about the XOR trick is that some optimization actually manages to translate the explicit C-level XOR sequence in example h, into a series of register movs. Is that a side effect of a higher level optimization (that happens to backfire in this case), or is it a specific optimization? Can such a specific xor->mov transformation ever be useful? (I don't expect a reply to that, just being curious...)

Revision history for this message
Joey Ye (jinyun-ye) said :
#3

Issue 1 saving/restoring is a missed optimization opportunity, which I have a local hack proving its availability.

Revision history for this message
Best Joey Ye (jinyun-ye) said :
#4

For issue 2, GCC converts xors into moves, with the belief that more optimization opportunity will be exposed. Though it increase register pressure, which unfortunately generates worse code in your case, I don't think it reasonable to disable such a conversion.

For example:

a = x;
b = y;

...
a ^= b;
b ^= a;
a ^= b;
// Now compiler may be confused to realize the a == y and b == x to optimize following branch.
// Converting into moves will make it crystal clear that a does equal to y here.
if ( a == y) ...

Revision history for this message
Andreas Fritiofson (andreas-fritiofson) said :
#5

Thanks Joey Ye, that solved my question.

Revision history for this message
Andreas Fritiofson (andreas-fritiofson) said :
#6

Thanks for the information and for the work on improving the compiler!

Do you have any thoughts about why newlib's write wrapper compiles so badly at -Os?

push {r4, r5, lr}
mov r3, r2
ldr r2, =__REENT
mov r5, r0
mov r4, r1
ldr r0, [r2, #0]
mov r1, r5
mov r2, r4
ldmia.w sp!, {r4, r5, lr}
b.w _write_r

I get a feeling the compiler is intentionally interleaving the loads with unrelated instructions to suit a processor that can advance with the intermediate instructions while the load completes. I don't think the Cortex-M3/M4 pipeline work that way, does it? Shouldn't those optimizations be disabled/tweaked when compiling for M3/M4?

Getting rid of the unnecessary movs, moving them into the optimal order and reusing the same register for the loads gives us:

mov r3, r2
mov r2, r1
mov r1, r0
ldr r0, =__REENT
ldr r0, [r0, #0]
b.w _write_r

Which is quite a lot better (at least smaller) since this entire function is just overhead anyway. I have an issue/suggestion regarding newlib on this matter but that's another question.

Revision history for this message
Joey Ye (jinyun-ye) said :
#7