armcc vs gcc-arm-none-eabi performance

Asked by randy

I want to see the performance difference between armcc and gcc, so i get a benchmark source project to test.
but the result gives me a big suprise: for the same code, gcc runs slow 2.9 times than armcc.
I don't know whether there is something woring with my benchmark c code or compile option, so I list some information here, and any suggestion is appreciated:
                                 enviroment iterate time:1 iterate time:10 state test time matrix test time bin size
armcc(O3 Otime) cpu,bus:80M FPGA 0xeb1 0x92E2 0x1a8 0xbb 14.2k
gcc (O3 Otime) 0x2a9f 0x1AA2C0 0x3ab 0x3c8 17.1k

while i use compile option like:-g -flto -ffunction-sections -fdata-sections -fno-inline --specs=rdimon.specs --specs=nosys.specs --specs=rdimon.specs -mthumb -mcpu=cortex-m3

update 2016/07/05:

i find something when convert 32bits to 16bits with unnecessary uxth:
 160e: 2001 movs r0, #1
 1610: f001 fad4 bl 2bbc <get_seed_32>
1614: 4603 mov r3, r0
1616: b29b uxth r3, r3
1618: f8a7 37d4 strh.w r3, [r7, #2004] ; 0x7d4

while armcc output like:
0x0000066c: 2001 . MOVS r0,#1
0x0000066e: f000fdd7 .... BL get_seed_32 ; 0x1220
0x00000672: f8ad07d8 .... STRH r0,[sp,#0x7d8]

armcc saves 2 instructions.

 so, can i do anything to avoid this with any options?

update 2016/07/06:

another unnecessary instructions with gcc:
18d6: f7ff f941 bl b5c <glbcnt_read>
    18da: 4603 mov r3, r0
    18dc: 460c mov r4, r1
    18de: 4a57 ldr r2, [pc, #348] ; (1a3c <main+0x5d8>)
    18e0: e9c2 3400 strd r3, r4, [r2]

while armcc output like the following:
0x000007fa: f7fffcdd .... BL glbcnt_read ; 0x1b8
0x000007fe: 465d ]F MOV r5,r11
0x00000800: e9cb0102 .... STRD r0,r1,[r11,#8]

it seems like moving value from r0 to r3 and from r1 to r4 is not unnecessary, is this a bug?

Question information

Language:
English Edit question
Status:
Solved
For:
GNU Arm Embedded Toolchain Edit question
Assignee:
No assignee Edit question
Solved by:
randy
Solved:
Last query:
Last reply:
Revision history for this message
Andre Vieira (andre-simoesdiasvieira) said :
#1

Hi Randy,

As far as I know armcc will unroll loops with -O3, whereas gcc does not. Try passing -funroll-all-loops to gcc, that might make it a fairer fight.

Out of curiosity, what versions of armcc and gcc are you using?

Cheers,
Andre

Revision history for this message
Andre Vieira (andre-simoesdiasvieira) said :
#2

Oh and an unrelated curiosity, why are you using both nosys.specs and rdimon.specs in the command line?

Revision history for this message
randy (qiuxiaoyu) said :
#3

Hi Andre,
thanks for your reply!

----->>>>As far as I know armcc will unroll loops with -O3, whereas gcc does not. Try passing -funroll-all-loops to gcc, that might make ----->>>>it a fairer fight.
I tried with O2 and the result is similar, but I will try with -funroll-all-loops

---->>>>what versions of armcc and gcc are you using?
I using armcc 4.1 and gcc 4.9(gcc-arm-none-eabi-4_9-2015q3)

---->>>>why are you using both nosys.specs and rdimon.specs in the command line?
Sorry, in fact i don't know there is any conflict between nosys.specs and rdimon.specs, i just try one possible way to make compile pass, can you explain some info about the two option, or give me some doc/link about this?

Thanks a lot again!

Revision history for this message
randy (qiuxiaoyu) said :
#4

i find something when convert 32bits to 16bits with unnecessary uxth:
 160e: 2001 movs r0, #1
 1610: f001 fad4 bl 2bbc <get_seed_32>
1614: 4603 mov r3, r0
1616: b29b uxth r3, r3
1618: f8a7 37d4 strh.w r3, [r7, #2004] ; 0x7d4

while armcc output like:
0x0000066c: 2001 . MOVS r0,#1
0x0000066e: f000fdd7 .... BL get_seed_32 ; 0x1220
0x00000672: f8ad07d8 .... STRH r0,[sp,#0x7d8]

armcc saves 2 instructions.

 so, can i do anything to avoid this with any options?

Revision history for this message
Thomas Preud'homme (thomas-preudhomme) said :
#5

A similar bug was fixed in GCC 6 I believe (and you would thus benefit it once this toolchain moves to GCC 6) and we are currently working on another case of useless uxth. My recommendation would be to try again when the next major release happens and see if it is fixed. You can also try to reduce the code to a minimal testcase and we can then check whether this compiles fine on the latest GCC.

Best regards.

Revision history for this message
randy (qiuxiaoyu) said :
#6

Hi Thomas, thanks for your reply!
I'will try it again when gcc 6 is released.
BTW, i'd like to give you some other info about unnecessary instructions compared to armcc:
18d6: f7ff f941 bl b5c <glbcnt_read>
    18da: 4603 mov r3, r0
    18dc: 460c mov r4, r1
    18de: 4a57 ldr r2, [pc, #348] ; (1a3c <main+0x5d8>)
    18e0: e9c2 3400 strd r3, r4, [r2]

while armcc output like the following:
0x000007fa: f7fffcdd .... BL glbcnt_read ; 0x1b8
0x000007fe: 465d ]F MOV r5,r11
0x00000800: e9cb0102 .... STRD r0,r1,[r11,#8]

it seems like move value from r0 to r3 and from r1 to r4 is not unneceaary, is this a bug?

Revision history for this message
Thomas Preud'homme (thomas-preudhomme) said :
#7

Hi Randi,

I'd be very curious to see the source code that generated this. It's quite surprising to see such suboptimal code at -O3.

Best regards.

Revision history for this message
randy (qiuxiaoyu) said :
#8

Hi Thomas, thanks for your reply.
the whole project is the cpu benchmark which is located in http://www.eembc.org/coremark/download.php,
and the corresponding code snippet is as follows:
 t1 = glbcnt_read();
       __dsb(0xF);
       t12 = t1 - t0;

 iterate(&results[0]);

 t2 = glbcnt_read();
 __dsb(0xF);
which output disassembly code with gcc looks like:
18dc: f7ff f93c bl b58 <glbcnt_read>
    18e0: 4602 mov r2, r0
    18e2: 460b mov r3, r1
    18e4: 4959 ldr r1, [pc, #356] ; (1a4c <main+0x5ec>)
    18e6: e9c1 2300 strd r2, r3, [r1]
    18ea: f3bf 8f4f dsb sy
    18ee: 4b57 ldr r3, [pc, #348] ; (1a4c <main+0x5ec>)
    18f0: e9d3 0100 ldrd r0, r1, [r3]
    18f4: 4b56 ldr r3, [pc, #344] ; (1a50 <main+0x5f0>)
    18f6: e9d3 2300 ldrd r2, r3, [r3]
    18fa: 1a82 subs r2, r0, r2
    18fc: eb61 0303 sbc.w r3, r1, r3
    1900: 4954 ldr r1, [pc, #336] ; (1a54 <main+0x5f4>)
    1902: e9c1 2300 strd r2, r3, [r1]
    1906: f207 73d4 addw r3, r7, #2004 ; 0x7d4
    190a: 4618 mov r0, r3
    190c: f7ff fd5a bl 13c4 <iterate>
    1910: f7ff f922 bl b58 <glbcnt_read>
    1914: 4602 mov r2, r0
    1916: 460b mov r3, r1
    1918: 494f ldr r1, [pc, #316] ; (1a58 <main+0x5f8>)
    191a: e9c1 2300 strd r2, r3, [r1]
    191e: f3bf 8f4f dsb sy

Revision history for this message
randy (qiuxiaoyu) said :
#9

Hi Thomas, i forget to tell you some info: glbcnt_read() returns long long type value(64bits)

Revision history for this message
randy (qiuxiaoyu) said :
#10

do anyone has any suggestion?

Revision history for this message
randy (qiuxiaoyu) said :
#11

i got the answer:
i use the -o2 instead -O2, the two is not the same at all,
please see: https://answers.launchpad.net/gcc-arm-embedded/+question/306365