Bad optimization for tiny wrappers
Hi!
I noticed some bad optimization in some very small wrappers that basically just reorder the parameters. The following is compiled using
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -Os -c test.c -o test.o
int f2(int b, int a);
int f(int a, int b)
{
return f2(b, a);
}
00000000 <f>:
0: 4603 mov r3, r0
2: 4608 mov r0, r1
4: 4619 mov r1, r3
6: f7ff bffe b.w 0 <f2>
So that's as optimal as it gets. Let's apply some register pressure:
int f4(int b, int a, int c, int d);
int g(int a, int b, int c, int d)
{
return f4(b, a, c, d);
}
0000000a <g>:
a: b510 push {r4, lr}
c: 4604 mov r4, r0
e: 4608 mov r0, r1
10: 4621 mov r1, r4
12: e8bd 4010 ldmia.w sp!, {r4, lr}
16: f7ff bffe b.w 0 <f4>
First issue... Why does it push and restore lr here? At -O2 it only pushes r4. Not only does it waste two cycles but it also requires the use of a wide load so it's actually larger than it has to be.
In this particular case it's still not optimal without the lr push. Let's try to be clever:
int h(int a, int b, int c, int d)
{
a ^= b;
b ^= a;
a ^= b;
return f4(a, b, c, d);
}
0000001a <h>:
1a: b510 push {r4, lr}
1c: 4604 mov r4, r0
1e: 4608 mov r0, r1
20: 4621 mov r1, r4
22: e8bd 4010 ldmia.w sp!, {r4, lr}
26: f7ff bffe b.w 0 <f4>
Wow! The compiler outsmarted us. It's the exact same result as g. Impressive deduction but completely counter-productive since the most trivial translation would be both smaller and faster:
int i(int a, int b, int c, int d)
{
asm volatile (
"eors %0, %1 \n"
"eors %1, %0 \n"
"eors %0, %1 \n"
: "=r"(a), "=r"(b));
return f4(a, b, c, d);
}
0000002a <i>:
2a: 4048 eors r0, r1
2c: 4041 eors r1, r0
2e: 4048 eors r0, r1
30: f7ff bffe b.w 0 <f4>
Is the compiler not tracking the fact that the condition codes do not need to be preserved in this context, or is there another reason it's trying so hard to avoid the XOR trick?
By the way, these types of parameter permutations are common in newlib, because for some reason it inserts the reentrancy pointer as the first argument to the reentrant syscalls. For an example of lousy optimization, check the disassembly of <write> in lib/armv7-m/libc.a and lib/armv7-
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- No assignee Edit question
- Solved by:
- Joey Ye
- Solved:
- Last query:
- Last reply: