How can you make code run faster? Analysing micro-performance with the nRF52

Working on fast calculation of CRCs I have an algorithm which uses a precalculated fixed lookup table to process data 1 byte at a time generating a 32-bit CRC. The calculation it's actually doing if the current CRC is C and the next byte is b is

C = LOOKUP[ ( C >> 24 ) ^ b ] ^ ( C << 8 )

where C is in a 32 bit register and >> is a logical shift. The assembler is here, for the nRF52, as it uses the extended 4-byte Thumb2 instructions

    //  on entry
    //
    //  r0         - current CRC
    //  r1         - pointer to byte data to add
    //  r2         - count of bytes to add
    //  r3         - pointer to 256 entry uint32_t CRC lookup table
    //
    //  on exit  
    //  r0         - the new CRC value

add_crc_bytes:   
	push { r4, lr }
	tst r2, r2
	beq 1f
0:
    ldrb r4, [r1], #1
    eors r0, r0, r4, lsl #24
    lsrs r4, r0, #24
    ldr  r4, [ r3, r4, lsl #2 ]
    eors r0, r4, r0, lsl #8
    subs r2, #1	
    bne 0b
1:
	pop { r4, pc }

The main loop is 7 instructions so I tested it by running 50 bytes through it, expecting to see something like 350 cycle counts. So I was fairly surprised to find it took 800 cycles. (DWT->CYCCNT gives you the current cycle count, so it's easy to work this out).

Single-stepping it I found that the two loads and the bne took more than one cycle. The loads must have wait states and the bne can take extra cycles because the pipeline empties and needs to be refilled. So I wanted to try some things to make this faster, and quantify where the waits were coming from.

The DWT debug unit on the nRF52 has some useful features, it can count LSU and CPI cycles. An LSU cycle is an additional cycle for a load or store, and a CPI cycle is an extra cycle used for multi-cycle instructions or caused by an instruction fetch stall. To use them you disable, then re-enable them in the DWT->CTRL register (which clears them to zero), do some work, then read the DWT->CPICNT and DWT->LSUCNT.

Adding some code to initialise and read these registers I found those 800 cycles broke down like this

 CYCLES       LSU        CPI        Useful
    800       248        201           351

351 makes sense, that's the 7 cycles per byte I was expecting. However the numbers show 5 wait states per byte on loads and 4 wait states per byte on instruction stalls. So how to make it faster?

The byte data being processed is in RAM, not much can be done there, however the pre-calculated CRC table was declared static const and so ends up in the .rodata section, which is in FLASH. Flash is slow, so probably a lot of the memory stalls come from doing the table lookup. Moving the lookup table from FLASH to RAM should help. Declaring it static or just forcing it into the .data section with an attribute changes that.

The code stalls also probably come from the fact the code is in FLASH, so refilling the pipeline after a branch (what's up with branch prediction here??) takes a few wait states. There's a few things we can try; we can enable the CPU cache, there's 1k of instruction cache right on the chip, since the routine is small, it should all get loaded onto it the first run and not need to be fetched from FLASH again. Another option is to run the code from RAM, which you usually do by putting it in the .fast section. With the nRF52 there are two options for running code from RAM, you can put it into regular system RAM around 0x20000000 (which is what most linker maps do by default) or you can put it into 0x800000. It's the same RAM, just mapped to two different addresses, however the processor access it via two different buses, the ICODE bus (for instructions) and the DCODE bus (for data). If mapped down at 0x800000, we're on the ICODE bus, which should be better for code.

So the options are lookup table in FLASH or RAM, processor cache on or off, code running from FLASH, regular RAM (0x20000000) or CODE RAM (0x800000). It's not hard to run 12 tests and see what you get, here are the results

CPU cache OFF:

      CODE     LOOKUP     CYCLES       LSU        CPI        Useful
      ====     ======     ======       ===        ===        ======
      FLASH     FLASH        800       248        201           351
      FLASH       RAM        653       101        201           351
        RAM     FLASH        847       249        247           351
        RAM       RAM        700       102        247           351
    CODERAM     FLASH        748       298         99           351
    CODERAM       RAM        601       151         99           351


CPU cache ON:

      CODE     LOOKUP     CYCLES       LSU        CPI        Useful
      ====     ======     ======       ===        ===        ======
      FLASH     FLASH        698       248         99           351
      FLASH       RAM        551       101         99           351
        RAM     FLASH        850       252        247           351
        RAM       RAM        700       102        247           351
    CODERAM     FLASH        751       301         99           351
    CODERAM       RAM        601       151         99           351

In each case the 'useful work' is 351 cycles, which means the numbers add up. Some of the results make sense, some are quite surprising.

Looking at the LSU column, that's the stalls due to data lookup. For the most part you can see when the lookup table was in FLASH, it's 5 cycles of wait per byte, when in RAM, it's 2. So we can estimate that a RAM lookup costs 1 wait cycle and a FLASH one, 4 cycles. However, when the code is in CODERAM (0x800000) for some reason there's an extra load stall, which I cannot explain.

With the CPU cache off, running from FLASH gives 4 wait states per byte in processor stalls (the CPI column), that's the cost of reading from FLASH. Running from CODERAM, you get one stall per byte, that's because RAM is quick to read and CODERAM is on the ICODE bus.

The strange result there is when running from normal system RAM the processor stalls go UP by 1 cycle per byte. I believe what's happening here is that the instructions need to be read over the DCODE bus, however that's also used for the data loads (and that's the first thing done after the loop) so the DCODE bus has to round robin between fetching instructions and fetching data. I have noticed that if two fetches are from fairly different addresses, you get an extra 1 or 2 cycles, I assume as the memory unit decodes the address. So in this case, with lots of data loads, code in normal system RAM isn't doing well. So if you do want to put code in the .fast section, make sure it goes into the ICODE bus mapped one at 0x800000, not into normal system RAM. (again note that many linkers put .fast code in regular system RAM by default)

Turning the CPU cache on (NVIC->ICACHECNF=0x01) makes FLASH as fast as CODERAM, just one cycle to refill the pipeline. However it doesn't help code running in system RAM at all. This is because the cache only caches the ICODE bus, not the DCODE bus and when code is running from system RAM, the DCODE bus is loading it, so it doesn't get cached. Another strike for putting code in regular system RAM.

So in the end, in this particular example, keeping the lookup table in RAM (obvious), leaving the code in FLASH and enabling the cache gave the best performance. I expected that code in CODERAM would be just as fast, but there seems to be one extra load stall in that case which I cannot explain.

My takeaways from this were

  1. nRF52 performance is complicated. The processor is faster, but memory accesses can stall it quite severely. Keep often-looked-up data in RAM.
  2. Running code from RAM instead of FLASH doesn't necessarily help you. If you are going to run from RAM, make sure the code is placed in the dual-mapped space down at 0x800000 where the processor can access it on a separate bus, not in regular system RAM.
  3. The CPU cache can dramatically increase performance, turn it on and try it.
  4. The DWT unit in the nRF52 is very powerful and can give you lots of insight into how your code is running at the processor level. There are more counters you can turn on (sleep, exception handling, code folding) and I believe you can push some of this information out of the SWO via the TPIU .. although that's going to take more reading to figure out.

Remember faster processing means less power(*), so spend some time making it fast and make it more battery-friendly in the process.

(*) yes the cpu cache consumes power, so there is a tradeoff between speed and current.