FPU performances calculation - optimizing real time math

Hi

I am using nRF52832, S132 / SDK17, implementing an algorithm requiring some math,

for example I am doing a matrix multiplication with about 800 float multiplications, I understand that a multiplication taking 3 cycles from the ARM-M4, working with 32MHz; and optimizing for time, I am seeing it 2400 cycles to take more than 200us -Does that make sense?

is there some way (not algorithmically that is) to improve those performances? some other optimizing flag to be raised, FPU enableing? a way to allocate the memory to be more efficiently accessed? 

Is there some example/reference you can refer me to?

Thanks!

  • Hi

    What IDE are you using for development? In SEGGER Embedded Studios you have optimization levels that you can choose based on what your application needs: 

    I haven't done the math, but optimization level 3 (if you have room for it) should provide the highest possible speed for your application AFAIK. FPU should also be enabled by default in most of our SDK v17.1.0 examples.

    Best regards,

    Simon

  • Dear Simon

    I am using uVision5 

    This is the setup, its based on the fft_fpu example

    Thanks

  • Hi

    Okay, that seems fine. If you've also added FLOAT_ABI_HARD to your preprocessor definitions, FPU should be enabled. In that case I'm not aware of any further ways to speed up calculations on an nRF52832. Is it necessary for your application to do these calculations faster?

    Best regards,

    Simon

  • Dear Simon

    I would like to do it faster, but I am also generally surprised for it to take so long,

    Can you give me some rough estimations about computation time with nRF52 with FPU, I think float multiplication should take 3 cycles, with 64MHz, according to my calculation 1000 multiplication should take about 15us. 

    Thanks! 

  • Hi

    Okay, so I've not been able to track down any expected numbers on the computation time estimations. For just the calculations I think the timings you refer to are correct, but if you also write this answer to a buffer, read the next input value from another buffer, and then update two buffer pointers, then the ~16 clock cycles you're seeing in 200µs for 800 computations will start to make sense I think, as some cycles will be lost to data handling as well. How exactly are you doing these operations on your end? It is also possible to check disassembly to see exactly what's happening.

    Best regards,

    Simon

Related