GNU Arm Embedded Toolchain

Version 7-2017-q4-major degrade cpu performance about 2.5 times

Asked by genadi on 2018-01-01

My project with heavy FPU operations have problems with last gcc.
Old magic -flto -Ofast -fno-math-errno -funroll-loops -fgraphite-identity -ffunction-sections -fdata-sections -ffat-lto-objects can not help...
Similar problems on Cortex-A9 and Cortex-M7

Question information

Language:: English Edit question

Status:: Answered

For:: GNU Arm Embedded Toolchain Edit question

Assignee:: No assignee Edit question

Last query:: 2018-01-02

Last reply:: 2018-01-26

Link existing bug

Revision history for this message

genadi (genaspb) said on 2018-01-02:

Investigations highlights in powf and logf functions usage of __aeabi_dmul and __aeabi_dsub calls instead of vsub.f32 and vmul.f32 in code from GCC 6.3

Revision history for this message

genadi (genaspb) said on 2018-01-02:

i'm use STM32F746 target with settings for Cortex-M7 (Hard FP):
-mthumb -march=armv7e-m -mfloat-abi=hard -mfpu=fpv5-sp-d16

readme.txt promise thumb/v7e-m/fpv5-sp/hard library path... But fpv5-sp is missing in installation directory.
fpv4-sp used instead

Revision history for this message

Tejas Belagod (belagod-tejas) said on 2018-01-02:

Hi,

Sorry, yes the readme.txt needs updating to point to the right multilib. For all fpv5-sp, we default to fpv4-sp now.

Also, in your statement above, I don't understand how f32 operations are replaced by __aeabi_d* calls. Can you please provide a small test case with the command-line so that we can reproduce the issue on our side?

If you have double-precision ops in your source code and you select a sp multilib, the ops will fall-back to soft emulation.

Thanks,
Tejas.

Revision history for this message

Thomas Preud'homme (thomas-preudhomme) said on 2018-01-02:

Hi Genadi,

Thanks for the bug report. FPv5-SP multilib is not missing, it is just the documentation being out of date. FPv4-SP and FPv5-SP had the same code so we changed the toolchain to use FPv4-SP multilib when -mfpu=fpv5-sp-d16 is used.

Regarding your problem, we are investigating why powf and logf. Some change happened to these functions to use double precision instructions when available as it is actually faster than all the extra code needed to keep the right precision when using single precision. It seems the logic to detect whether double precision instructions are available is flawed. We'll keep you posted of any update.

Best regards.

Revision history for this message

genadi (genaspb) said on 2018-01-02:

you say: If you have double-precision ops in your source code and you select a sp multilib, the ops will fall-back to soft emulation

I'm exactly call powf and logf with single precision operands and expect float as result. Again, gcc 6.x calculate these operations without overhead and additional command-line switches.

Revision history for this message

genadi (genaspb) said on 2018-01-02:

If I (and may be other peoples) need extra precision, pow and log functions used instead of single-precision versions.

Revision history for this message

genadi (genaspb) said on 2018-01-02:

>Also, in your statement above, I don't understand how f32 operations are replaced by __aeabi_d* calls.
> Can you please provide a small test case with the command-line so that we can reproduce the issue on our side?

I'm analyze objdump output of two ELFs after compilation with different versions of GCC.

"C:\Program Files (x86)\GNU Tools ARM Embedded\7 2017-q4-major\bin\arm-none-eabi-objdump" --disassemble-all -S tc1_stm32f746zg_rom.elf >7list.txt

Here lists:
https://drive.google.com/open?id=1TPucdMv9GCto3WYF_bdoRqACS1aTMZVD
https://drive.google.com/open?id=1kmLjWAHkXY0IJbyeVl4FbymNZXTC7noh

See text after <logf>:

Revision history for this message

genadi (genaspb) said on 2018-01-02:

Temporary solved: old pdp-11 sources of libc is a helpful resource... local copy of pow/log with kittle editing work with enough acurrracy and compareable to gxx 6.x performance.

Revision history for this message

Andreas Fritiofson (andreas-fritiofson) said on 2018-01-16:

We're also affected by this; new implementation of logf uses double-precision operations which are software emulated on fpv4-sp.

This seems to be considered a "feature": https://sourceware.org/ml/newlib/2017/msg00999.html

Not sure what they were thinking... Can this be reverted ASAP?

Note that I haven't actually benchmarked the performance of the new implementation compared to the old, I'm just assuming that all the __aeabi_dsub etc. calls made in the new version can't possibly be faster than the old algorithm using only the SP FPU instructions.

Noticed this because we were calling directly into __ieee_logf with known good arguments, but the link step failed with the latest version.

Revision history for this message

Andreas Fritiofson (andreas-fritiofson) said on 2018-01-16:

#10

Maybe this is just a build configuration issue? "Targets with a single precision FPU may still prefer the old
implementation." from the thread linked above.

Revision history for this message

Thomas Preud'homme (thomas-preudhomme) said on 2018-01-23:

#11

A patch has been posted for integration in newlib at https://sourceware.org/ml/newlib/2018/msg00029.html

If this gets accepted, the regression should be fixed in next release of our toolchain.

Best regards.

Revision history for this message

Andreas Fritiofson (andreas-fritiofson) said on 2018-01-23:

#12

Sounds good, thanks a lot!

Revision history for this message

Thomas Preud'homme (thomas-preudhomme) said on 2018-01-26:

#13

Patch has been committed to newlib. This means that our next update release will have the performance regression fixed. Thanks for the report.

Can you help with this problem?

Provide an answer of your own, or ask genadi for more information if necessary.

To post a message you must log in.

Ask a question

Edit question

GNU Arm Embedded Toolchain

Version 7-2017-q4-major degrade cpu performance about 2.5 times

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers