GNU Arm Embedded Toolchain

Floating-point stall due to data dependency

Asked by Dan Lewis on 2020-06-23

The ARM Cortex-M4 Processor Technical Reference Manual (r0p1) states, "Floating-point arithmetic data processing instructions, such as add, subtract, multiply, divide, square root, all forms of multiply with accumulate, in addition to conversions of all types take one cycle longer if their result is consumed by the following instruction."

I'm trying to measure this effect with mixed results. For example, the execution time of the following sequence is 200 clock cycles

.rept 100
VADD.F32 S1,S0,S0
VMOV R2,S2
.endr

And as expected, changing the VMOV to introduce a data dependency as shown below increases the execution time to 300 clock cycles.

.rept 100
VADD.F32 S1,S0,S0
VMOV R1,S1 // delayed waiting for S1
.endr

However, the measured execution time of the two code fragments below are identical at 600 cycles even though the latter clearly contains two data dependencies in the code on the right. That seems correct for the second sequence, but I would have expected the first to take 400 clock cycles.

.rept 100
VADD.F32 S1,S0,S0
VADD.F32 S2,S0,S0
VMOV R1,S1
VMOV R2,S2
.endr

.rept 100
VADD.F32 S1,S0,S0
VMOV R1,S1 // delayed waiting for S1
VADD/F32 S2,S0,S0
VMOV R2,S2 // delayed waiting for S2
.endr

FYI, I am measuring execution time by reading the DWT_CYCCNT clock cycle counter immediately before and after each sequence and taking the difference.

Can anyone please shed some light on this?

Dan