Bug: incorrect 'mul' used for "invariant division with multiplication" optimisation

Asked by John Adriaan on 2019-09-14

SYNOPSIS

gcc is attempting to use the "invariant division using multiplication" optimisation to avoid a 'div' instruction. The idea is that instead of "divide by 3", you can "multiply by 0xaaaaaaab then shift right 33 places". Only, the generated code is wrong in two respects.

THE PROJECT

To demonstrate the bug, I made a Mickey Mouse project:

/* heap.h */
struct Node {
 unsigned canary;
 unsigned size;
 unsigned data;
}; // Node
extern struct Node heap[];
extern struct Node heap_end;

/* heap.c */
#include "heap.h"
struct Node heap[1024];
struct Node heap_end;

/* main.c */
#include "heap.h"
int main() {
 int size = &heap_end - heap;
 return size;
} // main()

THE DESCRIPTION

The compiler sees that the two pointers-to-Node are subtracted, so it takes the difference between the addresses then divides by 12. Only, it actually divides by 4 with a shift, then (attempts to) divide by 3 using the above optimisation.

The two problems are:
1) It uses "mul.w", which is a "32-bit by 32-bit multiply, keeping the lower 32-bit result". That immediately throws away the desired top half! It should use "umull" instead, which is 32x32=>64.
2) There is no attempt to do the 33-bit shift right. If "umull" had have been used, then the upper 32-bit result could have been shifted right once, then used as the result.

If I change the size of Node to 8 or 16, then no explicit division is required (a shift will do) and the correct code is produced. It's a Node size of 12 that's the problem.

THE HOST

Windows 10 64-bit, latest update.

THE COMPILERS

I used the one that came with STM32CubeIDE, and then updated to the latest. Both produced identical incorrect code.

* arm-none-eabi-gcc (GNU Tools for STM32 7-2018-q2-update.20190328-1800) 7.3.1 20180622 (release) [ARM/embedded-7-branch revision 261907]
* arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 8.3.1 20190703 (release) [gcc-8-branch revision 273027]

THE COMPILE LINE

Note I get identical results whether the compile is -O0 or -O3

arm-none-eabi-gcc "../project/main.c" -mcpu=cortex-m4 -std=gnu11 -g3 -c -O0 -ffunction-sections -fdata-sections -Wall -Wextra -Wswitch-default -Wswitch-enum -Wconversion -fno-tree-loop-distribute-patterns --version -fstrict-volatile-bitfields -fno-strict-aliasing -fwrapv -fstack-usage -MMD -MP -MF"Project/main.d" -MT"Project/main.o" --specs=nano.specs -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb -o "project/main.o"

Ditto for heap.c

THE ASSEMBLER LISTING

int main() {
 8000000: b480 push {r7}
 8000002: b083 sub sp, #12
 8000004: af00 add r7, sp, #0
 int size = &heap_end - heap;
 8000006: 4a07 ldr r2, [pc, #28] ; (8000024 <main+0x24>)
 8000008: 4b07 ldr r3, [pc, #28] ; (8000028 <main+0x28>)
 800000a: 1ad3 subs r3, r2, r3
 800000c: 109b asrs r3, r3, #2
 800000e: 4a07 ldr r2, [pc, #28] ; (800002c <main+0x2c>)
 8000010: fb02 f303 mul.w r3, r2, r3
 8000014: 607b str r3, [r7, #4]
 return size;
 8000016: 687b ldr r3, [r7, #4]
} // main()
 8000018: 4618 mov r0, r3
 800001a: 370c adds r7, #12
 800001c: 46bd mov sp, r7
 800001e: f85d 7b04 ldr.w r7, [sp], #4
 8000022: 4770 bx lr
 8000024: 20003000 .word 0x20003000
 8000028: 20000000 .word 0x20000000
 800002c: aaaaaaab .word 0xaaaaaaab

Question information

Language:
English Edit question
Status:
Solved
For:
GNU Arm Embedded Toolchain Edit question
Assignee:
No assignee Edit question
Solved by:
john
Solved:
2019-09-16
Last query:
2019-09-16
Last reply:
2019-09-16
Best john (jkovach) said : #1

Provided that x % 3 == 0 (which is the case here), 0xaaaaaaab * x == x / 3.
So everything looks good here, unless I'm missing something.

John Adriaan (john-adriaan) said : #2

And THAT was what I was missing! My code was (of course) more complicated than the above, and the symbols involved weren't a neat distance apart - which I didn't realise. I was getting a size of 0xaaaababa or something.

When I realised that the full optimisation wasn't being followed completely for my code, I created the Mickey Mouse project. And when I saw that it had the same "issue" I posted this question immediately, without actually running the code.

Note that if the extra steps had have been followed, then the remainder would have been shifted into oblivion., and no problem would have occurred. Of course, that requires more complicated code. And given that C specifies that pointer subtraction is only legal for elements in the same array, this optimisation-of-an-optimisation is quite legal - but (as for all optimisations) horrible when the rules are broken, even if only accidentally!

Thank you!

John Adriaan (john-adriaan) said : #3

Thanks john, that solved my question.

john (jkovach) said : #4

"pointer subtraction is only legal for elements in the same array". Nice one. It's always interesting to see the kinds of unexpected behaviour you can get when you break these subtle rules. Thank you.

john (jkovach) said : #5

By the way, the assembler listing looks like at optimization level zero except this "multiplication instead of division" thing. If so, I would consider this a bug. Optimization level zero should imply straight division.

John Adriaan (john-adriaan) said : #6

> Optimization level zero should imply straight division.

I agree completely. Level 1 or 2 could do the full version of this optimisation. Only Level 3 should do this abbreviated version.