ternary optimization improves with superfluous cast

Asked by Nachum Kanovsky on 2018-02-04

In the code below, when defining WITH_CAST, the results of the compilation are significantly improved (with identical results in my larger codebase). The cast performed appears to be superfluous. I am running this within Keil 5.25pre2 (only as a simulator). I've used Keil simulator to check performance speed, by looking at what the t1 timer shows in terms of micro-seconds passed.

Snippet from code:
#if defined (WITH_CAST)
#define MAX(a,b) (((a) > (b)) ? (decltype(a)(a)) : (decltype(b)(b)))
#else
#define MAX(a,b) (((a) > (b)) ? ((a)) : ((b)))
#endif

GNU Arm Tools Embedded v. 7 2017-q4-major.

Compiler options:
-c -mcpu=cortex-m4 -mthumb -gdwarf-2 -MD -Wall -O -mapcs-frame -mthumb-interwork -std=c++14 -Ofast -I./RTE/_Target_1 -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/CMSIS/Include -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/Device/ARM/ARMCM4/Include -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/lib/gcc/arm-none-eabi/7.2.1/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1/arm-none-eabi" -D__UVISION_VERSION="525" -D__GCC -D__GCC_VERSION="721" -D_RTE_ -DARMCM4 -Wa,-alhms="*.lst" -o *.o

Assembler options:
-mcpu=cortex-m4 -mthumb --gdwarf-2 -mthumb-interwork --MD *.d -I./RTE/_Target_1 -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/CMSIS/Include -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/Device/ARM/ARMCM4/Include -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/lib/gcc/arm-none-eabi/7.2.1/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1/arm-none-eabi" -alhms="*.lst" -o *.o

Linker options:
-T ./RTE/Device/ARMCM4/gcc_arm.ld -mcpu=cortex-m4 -mthumb -mthumb-interwork -Wl,-Map="./Optimization.map"
-o Optimization.elf
*.o -lm

#include <cstdlib>
#include <cstring>
#include <cstdint>

#define WITH_CAST
struct mytype {
 uint32_t value;
 __attribute__((const, always_inline)) constexpr friend bool operator>(const mytype & t, const mytype & a) {
  return t.value > a.value;
 }
};
static mytype output_buf [32];
static mytype * output_memory_ptr = output_buf;
static mytype * volatile * output_memory_tmpp = &output_memory_ptr;
static mytype input_buf [32];
static mytype * input_memory_ptr = input_buf;
static mytype * volatile * input_memory_tmpp = &input_memory_ptr;
#if defined (WITH_CAST)
#define MAX(a,b) (((a) > (b)) ? (decltype(a)(a)) : (decltype(b)(b)))
#else
#define MAX(a,b) (((a) > (b)) ? ((a)) : ((b)))
#endif
int main (void) {
 const mytype * input = *input_memory_tmpp;
 mytype * output = *output_memory_tmpp;
 mytype p = input[0];
 mytype c = input[1];
 mytype pc = MAX(p, c);
 output[0] = pc;
 for (int i = 1; i < 31; i ++) {
  mytype n = input[i + 1];
  mytype cn = MAX(c, n);
  output[i] = MAX(pc, cn);
  p = c;
  c = n;
  pc = cn;
 }
 output[31] = pc;
}

Question information

Language:
English Edit question
Status:
Expired
For:
GNU Arm Embedded Toolchain Edit question
Assignee:
No assignee Edit question
Last query:
2018-02-12
Last reply:
2018-02-27

Hi Nachum,

This is a tough one and Ill put up a disclaimer that my C++-mojo isn't enough to guarantee what I am about to say is 100% correct, but....

It is my understanding that your use of 'decltype' here is creating a prvalue, because when 'a' and/or 'b' is passed to decltype they are seen as operands and it creates a temporary (aka prvalue). That makes decltype return 'mytype' and thus turn the outcome of the ternary operation a prvalue. This means the compiler knows this is a temporary and is free to do copy elision.

Without the decltype the output of the ternary is an lvalue and it can't do the copy elision, hence the extra copy code. If you want to enforce the same behavior, you would need to pass '(a)' to decltype, so 'decltype((a))' which means the input to decltype would be an lvalue and the cast would now be (mytype&) leaving the result of the ternary operation as an lvalue.

Hope this helps a bit.

Regards,
Andre

Nachum Kanovsky (nachumk) said : #2

Hi Andre,

Thank you, your analysis looks correct, but I still expect GCC to optimize this better. In comparison, when I compile the same code using:
typedef uint32_t mytype;
and comment out the mytype struct, the code compiles optimally whether WITH_STRUCT is defined or not.

I would expect GCC to optimize this struct the same as just a uint32_t.

Thanks,
Nachum

Nachum Kanovsky (nachumk) said : #3

Correction: meant WITH_CAST instead of WITH_STRUCT

Launchpad Janitor (janitor) said : #4

This question was expired because it remained in the 'Open' state without activity for the last 15 days.