This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Arduino Nano 33 BLE (nRF52840) seems not to run on max. clock speed

We're porting over some software to the Arduino Nano 33 BLE Sense. This board comes with an nRF52840 clocked at a max. of 64MHz. However, the same code runs about 2x slower on this board than on the ST L475VG platform, which also spans a Cortex-M4F (running at 80MHz). This makes me suspect that we're not running the board at the max. frequency (but rather at 16MHz or something). The only place where the main clock is defined seems to be here: https://github.com/arduino/ArduinoCore-nRF528x-mbedos/blob/beac74ca3cd9d07363f66cf9cda6b143e4385cd2/cores/arduino/mbed/targets/TARGET_NORDIC/TARGET_NRF5x/TARGET_NRF51/TARGET_MCU_NRF51822_UNIFIED/sdk/nrf_drv_config.h#L60, which should be 64MHz. Both targets are using Mbed OS underneath.

Here's a code sample of a pretty simple matrix multiplication in software. We see similar slowdown when using CMSIS-DSP:

#include "mbed.h"

/**
 * A matrix structure that allocates a matrix on the heap.
 * Freeing happens by calling `delete` on the object or letting the object go out of scope.
 */
typedef struct ei_matrix2 {
    float *buffer;
    uint16_t rows;
    uint16_t cols;
    bool buffer_managed_by_me;

    /**
     * Create a new matrix
     * @param n_rows Number of rows
     * @param n_cols Number of columns
     * @param a_buffer Buffer, if not provided we'll alloc on the heap
     */
    ei_matrix2(
        uint16_t n_rows,
        uint16_t n_cols,
        float *a_buffer = NULL
        )
    {
        if (a_buffer) {
            buffer = a_buffer;
            buffer_managed_by_me = false;
        }
        else {
            buffer = (float*)calloc(n_rows * n_cols * sizeof(float), 1);
            buffer_managed_by_me = true;
        }
        rows = n_rows;
        cols = n_cols;
    }

    ~ei_matrix2() {
        if (buffer && buffer_managed_by_me) {
            free(buffer);
        }
    }
} matrix2_t;

/**
* Multiply two matrices lazily per row in matrix 1 (MxN * NxK matrix)
* @param i matrix1 row index
* @param row matrix1 row
* @param matrix1_cols matrix1 row size (1xN)
* @param matrix2 Pointer to matrix2 (NxK)
* @param out_matrix Pointer to out matrix (MxK)
* @returns 0 if OK
*/
static inline int dot_by_row(int i, float *row, size_t matrix1_cols, matrix2_t *matrix2, matrix2_t *out_matrix) {
    if (matrix1_cols != matrix2->rows) {
        return -1;
    }

    for (size_t j = 0; j < matrix2->cols; j++) {
        for (size_t k = 0; k < matrix1_cols; k++) {
            out_matrix->buffer[i * matrix2->cols + j] +=
                row[k] * matrix2->buffer[k * matrix2->cols + j];
        }
    }

    return 0;
}

/**
* Multiply two matrices (MxN * NxK matrix)
* @param matrix1 Pointer to matrix1 (MxN)
* @param matrix2 Pointer to quantized matrix2 (NxK)
* @param out_matrix Pointer to out matrix (MxK)
* @returns 0 if OK
*/
static int dot(matrix2_t *matrix1,
                matrix2_t *matrix2,
                matrix2_t *out_matrix)
{
    if (matrix1->cols != matrix2->rows) {
        return -1;
    }

    // no. of rows in matrix1 determines the
    if (matrix1->rows != out_matrix->rows || matrix2->cols != out_matrix->cols) {
        return -1;
    }

    memset(out_matrix->buffer, 0, out_matrix->rows * out_matrix->cols * sizeof(float));

    for (size_t i = 0; i < matrix1->rows; i++) {
        dot_by_row(i,
            matrix1->buffer + (i * matrix1->cols),
            matrix1->cols,
            matrix2,
            out_matrix);
    }

    return 0;
}

int main() {
    Timer t;
    t.start();
    uint64_t ticks = 0;
    while (1) {
        if (++ticks > 10000) {
            break;
        }


        float matrix1_buffer[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
        float matrix2_buffer[] = { 3, 2, 1, 6, 5, 4, 9, 8, 7 };
        ei_matrix matrix1(3, 3, matrix1_buffer);
        ei_matrix matrix2(3, 3, matrix2_buffer);
        ei_matrix out_matrix(3, 3);

        int r = dot(&matrix1, &matrix2, &out_matrix);
        if (r != 0) {
            printf("matrix multiply failed (%d)\n", r);
            break;
        }
    }
    t.stop();
    printf("matrix multiplications took %d ms. (%llu)\n", t.read_ms(), ticks);
}

This takes 326ms. on the ST L475, and 2403ms. on the Nano 33 BLE.

Anyone any idea on how we can check the actual frequency without an oscilloscope?

Edit: I managed to speed this up to 786ms. by enabling `-mfpu=fpv4-sp-d16`, setting `-fmloat-abi=hard` (over softfp), and placing and executing the matrix functions in RAM. Still a significant slowdown unfortunately.

Parents
  • Hi Jan, 

    the CPU frequency of the nRF52840 is not configurable so there must be another explanation for the slower execution. Could it be that the FPU on the nRF52840 isnt enabled in the compilation settings?How are you compiling the source code? GCC?

    I copy pasted this from another DevZone case:

    You can enable the FPU in your compiler setting.

    For Keil and IAR, you can find the "hardware floating point" option in the project settings, while in GCC you set this in the makefile: 

    CFLAGS += -mfloat-abi=hard -mfpu=fpv4-sp-d16

    LDFLAGS += -mfloat-abi=hard -mfpu=fpv4-sp-d16

     

    All the above methods will define __FPU_USED, which will enable the FPU in system_nrf52.c:

    /* Enable the FPU if the compiler used floating point unit instructions.

    * __FPU_USED is a MACRO defined by the

    * compiler. Since the FPU consumes energy, remember to disable FPU use in the

    * compiler if floating point unit

    * operations are not used in your code. */

    #if (__FPU_USED == 1)

    SCB->CPACR |= (3UL << 20) | (3UL << 22);

    __DSB();

    __ISB();

    #endif

     

    __DSB and __ISB are intrinsics, which are C-wrapped assembly-calls.

    Their function is to provide "Data synchronization barrier" and "Instruction synchronization barrier", to ensure that the write of the memory/instruction is performed before moving along in the execution of code.

    The ARM infocenter website didn't want to load at my end, but I found this stackoverflow thread that explains the functionality:

    http://stackoverflow.com/questions/15491751/real-life-use-cases-of-barriers-dsb-dmb-isb-in-arm

  • Hi Bjorn, thanks a lot for your reply. I indeed realized that `-mfpu=fpv4-sp-d16` was not declared in the compile options for my project (I assumed this would have been set automatically, but apparently not), and this significantly speeded up the compilation. With GCC9, `-O3` `mfloat-abi=hard` (over softfp which I had before) and by placing the functions in RAM this yields execution time of 786 ms. Still significantly higher than on the ST target unfortunately.

  • Hmm, coremark-wise the STM32L475xx (3.42 CoreMark/MHz @ 80 MHz)  is faster than the nRF52840 (3.3 CoreMark per MHz, running CoreMark from flash, cache enabled), which is expected given that it runs at 80MHz vs the 64MHz on the nRF52, but 2 times slower execution on the nRF52840 seems much. 

    I was thinking that it could be that the cache wasnt enabled, but if you are running the code from RAM then it cant be that.  I see that the STM32L475xx has an Adaptive real-time memory accelerator (ART Accelerator), which may improve the execution time on the ST compared to the nRF52840.

  • Thanks Bjorn, Arduino was pointing me in the direction of the ART as well. I've ordered a nRF52-DK and will do some experiments on that with the Nordic SDK in the next few days.

Reply Children
No Data
Related