Floating-point stall due to data dependency
The ARM Cortex-M4 Processor Technical Reference Manual (r0p1) states, "Floating-point arithmetic data processing instructions, such as add, subtract, multiply, divide, square root, all forms of multiply with accumulate, in addition to conversions of all types take one cycle longer if their result is consumed by the following instruction."
I'm trying to measure this effect with mixed results. For example, the execution time of the following sequence is 200 clock cycles
.rept 100
VADD.F32 S1,S0,S0
VMOV R2,S2
.endr
And as expected, changing the VMOV to introduce a data dependency as shown below increases the execution time to 300 clock cycles.
.rept 100
VADD.F32 S1,S0,S0
VMOV R1,S1 // delayed waiting for S1
.endr
However, the measured execution time of the two code fragments below are identical at 600 cycles even though the latter clearly contains two data dependencies in the code on the right. That seems correct for the second sequence, but I would have expected the first to take 400 clock cycles.
.rept 100
VADD.F32 S1,S0,S0
VADD.F32 S2,S0,S0
VMOV R1,S1
VMOV R2,S2
.endr
.rept 100
VADD.F32 S1,S0,S0
VMOV R1,S1 // delayed waiting for S1
VADD/F32 S2,S0,S0
VMOV R2,S2 // delayed waiting for S2
.endr
FYI, I am measuring execution time by reading the DWT_CYCCNT clock cycle counter immediately before and after each sequence and taking the difference.
Can anyone please shed some light on this?
Dan
Question information
- Language:
- English Edit question
- Status:
- Expired
- Assignee:
- No assignee Edit question
- Last query:
- Last reply: