Most efficient way to delay a few microseconds on nrf52840

Just for fun, I thought it'd be interesting to see how to implement a delay on the nrf52840 using the least possible power.

For any long period of time, you should use RTC, TIMER, and then put the processor into a lower power state, but I needed some tiny pauses, and I was curious about the most efficient way to do so.

Here are some measurements... I tried a few different things:

nrf_delay_us: 3.3mA

This solution: 1.9mA

The final code I'm using is this:

__attribute__((noinline)) void DelayMicros(uint32_t micros) {
  uint32_t count = micros;

  // Inline pause.
  // Measured at 64 cycles per loop -- at 64MHz for nrf52840, one loop is one microsecond.
  int a = -1;
  int b = 1;
  int c;
 
  // This is 40% lower power consumption compared to nrf_delay_us.
  do {
    asm volatile("udiv %0, %1, %2" : "+r"(c) : "r"(a), "r"(b));
    asm volatile("udiv %0, %1, %2" : "+r"(c) : "r"(a), "r"(b));
    asm volatile("udiv %0, %1, %2" : "+r"(c) : "r"(a), "r"(b));
    asm volatile("udiv %0, %1, %2" : "+r"(c) : "r"(a), "r"(b));
    asm volatile("udiv %0, %1, %2" : "+r"(c) : "r"(a), "r"(b));
  } while (--count);
}

There's an overhead of about 14 cycles to call and return from the function and set up the variables.

Why udiv? I had a theory that if I took the instruction that took the most cycles to execute (and something that avoids accessing memory), it would avoid fetch and decode logic from running. The udiv instruction is listed as 2-12 cycles on the cortex m4 datasheet, so I've given it operands to maximize that cycle count. I have no idea if this is the reason it works better or not, but that's how I ended up trying it.

Just thought I'd share it in case anyone else wants to use it.

note: The graph was taken with BLE central & peripheral active, as I wanted to ensure long cycle count instructions didn't cause problems.

  • Nice post! I confirm your findings for reduced power consumption on a nRF52832 which I happen to have running with nothing attached to any pins and no errata workarounds; IAR compiler. This is useful.

    Conditions are 3.3VDD supply, DCD enabled

    void TestDelayMicros(void)
    {
       NRF_POWER->DCDCEN = 1;
       while(1)
       {
           DelayMicros(1000);  // 1.96mA with VDD supply 3.3V, DCDCEN enabled
         //nrf_delay_us(1000); // 2.61mA with VDD supply 3.3V, DCDCEN enabled
       }
    }

  • Hi jthlim, 
    Thanks for the interesting post. I can confirm what you reported. I have checked a little bit internally. What I was told is that the consumption of the CPU depends largely on the what's put on the bus and if the instruction is executed from CACHE or RAM. 
    The number we provided in the specification was measured when the CPU running CoreMark test, which I assume involve random values on the bus. 

    When I invert the value of a and b in your function, I can already see that the power consumption jumped drastically to over what's the nrf_delay_us() consume. You can see in the attached graph, the function with inverted a and b is called right after yours DelayMicros:

    The next one is when I keep b=1 and a=0xFFFF, the power consumption is much lower and near what you have. 


    We can clearly see that the value on the bus is really affecting the power consumption. 

    Have you tried to test with different optimization level to see if the power consumption is changed and if the timing is consistent ? I thing the timing may not be the same when you change the optimization level. 

    How long do you need the delay ? If it's >500us, using a TIMER and put the CPU to IDLE maybe a better option. 

  • As I mentioned in my original post, yes, any moderate time you should use something other than a busy loop. My use case is just for 5us. I'm also using DCDCEN. 

    In your tests, if you invert a and b, you're going to have different power consumption -- that's completely expected. udiv clock cycles from 2-12 cycles on the M4, so your decode/fetch will be way more active when it's 2 cycles per instruction rather than 12.

    Optimization is currently at -O2, and no intention to change this for my project.  It produces the following:

    00027534 <_ZN9Nrf5Clock11DelayMicrosEm>:
    27534: f04f 31ff mov.w r1, #4294967295 @ 0xffffffff
    27538: 2201 movs r2, #1
    2753a: fbb1 f3f2 udiv r3, r1, r2
    2753e: fbb1 f3f2 udiv r3, r1, r2
    27542: fbb1 f3f2 udiv r3, r1, r2
    27546: fbb1 f3f2 udiv r3, r1, r2
    2754a: fbb1 f3f2 udiv r3, r1, r2
    2754e: 3801 subs r0, #1
    27550: d1f3 bne.n 2753a <_ZN9Nrf5Clock11DelayMicrosEm+0x6>
    27552: 4770 bx lr

    At the end of the day, it's not a huge difference as I'm only using it for tiny delays; but afaik no reason not to do it this way :)

  • Hi jthlim,

    I agree. I will forward this internally. Even it's only applied for short delay, I think there is many use cases that get benefit from this lower power consumption wait.  Will let you know when I have a feedback. 

Related