Version 7-2017-q4-major degrade cpu performance about 2.5 times

Asked by genadi

My project with heavy FPU operations have problems with last gcc.
Old magic -flto -Ofast -fno-math-errno -funroll-loops -fgraphite-identity -ffunction-sections -fdata-sections -ffat-lto-objects can not help...
Similar problems on Cortex-A9 and Cortex-M7

Question information

Language:
English Edit question
Status:
Answered
For:
GNU Arm Embedded Toolchain Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
genadi (genaspb) said :
#1

Investigations highlights in powf and logf functions usage of __aeabi_dmul and __aeabi_dsub calls instead of vsub.f32 and vmul.f32 in code from GCC 6.3

Revision history for this message
genadi (genaspb) said :
#2

i'm use STM32F746 target with settings for Cortex-M7 (Hard FP):
-mthumb -march=armv7e-m -mfloat-abi=hard -mfpu=fpv5-sp-d16

readme.txt promise thumb/v7e-m/fpv5-sp/hard library path... But fpv5-sp is missing in installation directory.
fpv4-sp used instead

Revision history for this message
Tejas Belagod (belagod-tejas) said :
#3

Hi,

Sorry, yes the readme.txt needs updating to point to the right multilib. For all fpv5-sp, we default to fpv4-sp now.

Also, in your statement above, I don't understand how f32 operations are replaced by __aeabi_d* calls. Can you please provide a small test case with the command-line so that we can reproduce the issue on our side?

If you have double-precision ops in your source code and you select a sp multilib, the ops will fall-back to soft emulation.

Thanks,
Tejas.

Revision history for this message
Thomas Preud'homme (thomas-preudhomme) said :
#4

Hi Genadi,

Thanks for the bug report. FPv5-SP multilib is not missing, it is just the documentation being out of date. FPv4-SP and FPv5-SP had the same code so we changed the toolchain to use FPv4-SP multilib when -mfpu=fpv5-sp-d16 is used.

Regarding your problem, we are investigating why powf and logf. Some change happened to these functions to use double precision instructions when available as it is actually faster than all the extra code needed to keep the right precision when using single precision. It seems the logic to detect whether double precision instructions are available is flawed. We'll keep you posted of any update.

Best regards.

Revision history for this message
genadi (genaspb) said :
#5

you say: If you have double-precision ops in your source code and you select a sp multilib, the ops will fall-back to soft emulation

I'm exactly call powf and logf with single precision operands and expect float as result. Again, gcc 6.x calculate these operations without overhead and additional command-line switches.

Revision history for this message
genadi (genaspb) said :
#6

If I (and may be other peoples) need extra precision, pow and log functions used instead of single-precision versions.

Revision history for this message
genadi (genaspb) said :
#7

>Also, in your statement above, I don't understand how f32 operations are replaced by __aeabi_d* calls.
> Can you please provide a small test case with the command-line so that we can reproduce the issue on our side?

I'm analyze objdump output of two ELFs after compilation with different versions of GCC.

"C:\Program Files (x86)\GNU Tools ARM Embedded\7 2017-q4-major\bin\arm-none-eabi-objdump" --disassemble-all -S tc1_stm32f746zg_rom.elf >7list.txt

Here lists:
https://drive.google.com/open?id=1TPucdMv9GCto3WYF_bdoRqACS1aTMZVD
https://drive.google.com/open?id=1kmLjWAHkXY0IJbyeVl4FbymNZXTC7noh

See text after <logf>:

Revision history for this message
genadi (genaspb) said :
#8

Temporary solved: old pdp-11 sources of libc is a helpful resource... local copy of pow/log with kittle editing work with enough acurrracy and compareable to gxx 6.x performance.

Revision history for this message
Andreas Fritiofson (andreas-fritiofson) said :
#9

We're also affected by this; new implementation of logf uses double-precision operations which are software emulated on fpv4-sp.

This seems to be considered a "feature": https://sourceware.org/ml/newlib/2017/msg00999.html

Not sure what they were thinking... Can this be reverted ASAP?

Note that I haven't actually benchmarked the performance of the new implementation compared to the old, I'm just assuming that all the __aeabi_dsub etc. calls made in the new version can't possibly be faster than the old algorithm using only the SP FPU instructions.

Noticed this because we were calling directly into __ieee_logf with known good arguments, but the link step failed with the latest version.

Revision history for this message
Andreas Fritiofson (andreas-fritiofson) said :
#10

Maybe this is just a build configuration issue? "Targets with a single precision FPU may still prefer the old
implementation." from the thread linked above.

Revision history for this message
Thomas Preud'homme (thomas-preudhomme) said :
#11

A patch has been posted for integration in newlib at https://sourceware.org/ml/newlib/2018/msg00029.html

If this gets accepted, the regression should be fixed in next release of our toolchain.

Best regards.

Revision history for this message
Andreas Fritiofson (andreas-fritiofson) said :
#12

Sounds good, thanks a lot!

Revision history for this message
Thomas Preud'homme (thomas-preudhomme) said :
#13

Patch has been committed to newlib. This means that our next update release will have the performance regression fixed. Thanks for the report.

Can you help with this problem?

Provide an answer of your own, or ask genadi for more information if necessary.

To post a message you must log in.