GNU Arm Embedded Toolchain

registers pushed on stack when not needed

Asked by InuY4sha on 2014-09-01

Hi all,
I'm compiling (g++) the following code

foo_t * bar_t::zoo(mydata_t *data){
foo_t * foo;
if(!subroutine())
foo=foo1;
else
foo=foo2;

return foo->yasubroutine(data);
}

with these options: -c -W -Wall -Os -g -mthumb -march=armv7-m -mcpu=cortex-m3 -nostartfiles -fno-rtti -fno-exceptions -ffreestanding -ffunction-sections -fdata-sections --specs=nano.specs -Os -g

the assembly outcome is this:

   0: b538 push {r3, r4, r5, lr}
   2: 4604 mov r4, r0
   4: f101 0008 add.w r0, r1, #8
   8: 460d mov r5, r1
   a: f7ff fffe bl 0 <endpoint_t::is_null()>
   e: 68a2 ldr r2, [r4, #8]
  10: 6863 ldr r3, [r4, #4]
  12: 4629 mov r1, r5
  14: 2800 cmp r0, #0
  16: bf08 it eq
  18: 4613 moveq r3, r2
  1a: 681a ldr r2, [r3, #0]
  1c: 4618 mov r0, r3
  1e: 6812 ldr r2, [r2, #0]
  20: 4790 blx r2
  22: bd38 pop {r3, r4, r5, pc}

When invoking "foo->yasubroutine" the subroutine arguments are in R0 and R1. I was wondering why the last instruction (pop {r3, r4, r5, pc}) cannot be done BEFORE "BLX R2" invocation.
Assuming the called subroutine will anyway save R3-R5 (just like this one is doing), couldn't the code be instead something like this?

22: bd38 pop {r3, r4, r5, pc}
20: 4790 bx r2

The result would be saving space in the task stack. Is there any optimization to do this?
Regards,
R

Question information

Language:: English Edit question

Status:: Answered

For:: GNU Arm Embedded Toolchain Edit question

Assignee:: No assignee Edit question

Last query:: 2014-09-02

Last reply:: 2014-09-03

Link existing bug

Revision history for this message

TM (tm1234) said on 2014-09-01:

Wouldn't

pop {r3, r4, r5, pc}
bx r2

return from the current function (by virtue of the fact that the stacked lr gets popped into the pc) and so the bx r2 would never get executed?

Also - I'm not sure if all functions necessarily save/restore r3-r5 or if what they save/restore depends on their automatic/stack variables? You might want to check the relevant [E]ABI.

Hope this helps

Revision history for this message

Zhenqiang Chen (zhenqiang-chen) said on 2014-09-02:

r0 - r3 are caller-saved registers. They are not necessary to save/restore. In this case, "save/restore r3" is to make sure the stack is 8-bytes aligned.

Revision history for this message

InuY4sha (riccardomanfrin) said on 2014-09-02:

@TM you're absolutely right,
the correct sequence I would have expected is

pop {r3, r4, r5}
bx r2
pop pc

Wouldn't this save up 12 bytes of stacked registers?

Revision history for this message

InuY4sha (riccardomanfrin) said on 2014-09-02:

Errata corrige:
pop {r3, r4, r5}
--- bx r2
+++ blx r2
pop pc

Revision history for this message

InuY4sha (riccardomanfrin) said on 2014-09-02:

@TM
I'm compiling with the above described gcc options on a arm-none-eabi- target toolchain. (the one provided in this website).
Could you point me with some reference on the correct EABI doc to read?
I really need to understand this, as I cannot afford losing 16 bytes of stack for each nested call by default. My code *can* end up using recursion in which case I need to be extremely efficient on stack consumption.
Regards

Revision history for this message

Thomas Preud'homme (thomas-preudhomme) said on 2014-09-02:

As Zhenqiang explained, r0-r3 registers are caller-saved which in your case means if you don't save them before doing the blx they might get corrupted by the called routine. This is explained in the [AAPCS] in section 5.1.1 named "Core registers".

[AAPCS] http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042e/IHI0042E_aapcs.pdf

Best regards.

Revision history for this message

InuY4sha (riccardomanfrin) said on 2014-09-02:

@Thomas and Zhenqiang
I am afraid I'm not understanding the insights of your answer. What I do not get, in particular, is why should the compiler care about restoring registers if those are not needed below the BLX?

If you go like this

PUSH {R3, R4, R5, LR}
...
BLX R2
POP {r3, r4, r5, pc}

you are restoring R3, R4 and R5 to their value before leaving the function. But there is absolutely no need for this; the last POP will load the PC with stored LR value, hence throwing me back to the callee function.
If you consider the following different sequence of assembly instructions

PUSH {R3, R4, R5, LR}
...
POP {R3, R4, R5, LR}
BX R2

it should not bring any corruption in.

The link register will remain the same and will throw me back outside the function from the inner subroutine without stacking any registers.
Can you build a counter example to show how the last instruction in the original asm sequence (22: bd38 pop {r3, r4, r5, pc}) is anyhow required (aside PC that is, in that sequence required to jump back outside)?

Revision history for this message

InuY4sha (riccardomanfrin) said on 2014-09-02:

Edit:
The last asm sequence IS required, but could be done before calling the BX R2 hence saving stack space

Revision history for this message

TM (tm1234) said on 2014-09-02:

> Could you point me with some reference on the correct EABI doc to read?

Thomas Preud'homme (thomas-preudhomme) in comment #6 has linked to the relevant doc.

Revision history for this message

TM (tm1234) said on 2014-09-02:

#10

> My code *can* end up using recursion

Slightly off topic from your original/core question but this may not be a good idea in an embedded system.
After all embedded/critical systems coding standards such as MISRA-C prohibit recursion for a reason. :-)
If you were not using recursion would it render your original/core question moot?

Revision history for this message

InuY4sha (riccardomanfrin) said on 2014-09-02:

#11

> this may not be a good idea in an embedded system
I'm aware of this. Recursion is under control and cannot be avoided due to the nature itself of the task that I need to perform, but is limited to well defined cases and depth is restrained.

I've seen the doc. Pretty much insightful, but what do you think about this sequence ?

PUSH {R3, R4, R5, LR}
...
POP {R3, R4, R5, LR}
BX R2

Does it make sense to you?

Revision history for this message

Joey Ye (jinyun-ye) said on 2014-09-02:

#12

> PUSH {R3, R4, R5, LR}
> ...
> POP {R3, R4, R5, LR}
> BX R2
This IS tail call merge, which is a standard optimization in GCC. However, the GCC version you used does not merge tail call in case of indirect call due to some conservative consideration.

The good news is that the next major release should enable this, and here is what I got with a new a version:

        pop {r4, lr}
        ldr r3, [r3]
        bx r3

Thanks,
Joey

Revision history for this message

InuY4sha (riccardomanfrin) said on 2014-09-02:

#13

>> PUSH {R3, R4, R5, LR}
>> ...
>> POP {R3, R4, R5, LR}
>> BX R2
>
>This IS tail call merge, which is a standard optimization in GCC. However, the GCC version you used does not merge tail call in case of >indirect call due to some conservative consideration.
>
>The good news is that the next major release should enable this, and here is what I got with a new a version:
>
> pop {r4, lr}
> ldr r3, [r3]
> bx r3

Thanks it's so unfortunate I did not know the appropriate name for what I was asking. So it is the "tail call merge".
Can you provide details on which next major release you are using and how to obtain it?
No complains with compiling toolchains from scratch if that's what is needed.
Hopefully this information could solve my problem.
R

Revision history for this message

Joey Ye (jinyun-ye) said on 2014-09-03:

#14

Next release including this optimization will be 4.9 targeting end of 2014.

On Tue, Sep 2, 2014 at 8:13 PM, InuY4sha <
<email address hidden>> wrote:

> Question #253914 on GCC ARM Embedded changed:
> https://answers.launchpad.net/gcc-arm-embedded/+question/253914
>
> Status: Answered => Open
>
> InuY4sha is still having a problem:
> >> PUSH {R3, R4, R5, LR}
> >> ...
> >> POP {R3, R4, R5, LR}
> >> BX R2
> >
> >This IS tail call merge, which is a standard optimization in GCC.
> However, the GCC version you used does not merge tail call in case of
> >indirect call due to some conservative consideration.
> >
> >The good news is that the next major release should enable this, and here
> is what I got with a new a version:
> >
> > pop {r4, lr}
> > ldr r3, [r3]
> > bx r3
>
> Thanks it's so unfortunate I did not know the appropriate name for what I
> was asking. So it is the "tail call merge".
> Can you provide details on which next major release you are using and how
> to obtain it?
> No complains with compiling toolchains from scratch if that's what is
> needed.
> Hopefully this information could solve my problem.
> R
>
> --
> You received this question notification because you are an answer
> contact for GCC ARM Embedded.
>

Can you help with this problem?

Provide an answer of your own, or ask InuY4sha for more information if necessary.

To post a message you must log in.

Ask a question

Edit question

GNU Arm Embedded Toolchain

registers pushed on stack when not needed

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers