This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Arduino Nano 33 BLE (nRF52840) seems not to run on max. clock speed

We're porting over some software to the Arduino Nano 33 BLE Sense. This board comes with an nRF52840 clocked at a max. of 64MHz. However, the same code runs about 2x slower on this board than on the ST L475VG platform, which also spans a Cortex-M4F (running at 80MHz). This makes me suspect that we're not running the board at the max. frequency (but rather at 16MHz or something). The only place where the main clock is defined seems to be here: https://github.com/arduino/ArduinoCore-nRF528x-mbedos/blob/beac74ca3cd9d07363f66cf9cda6b143e4385cd2/cores/arduino/mbed/targets/TARGET_NORDIC/TARGET_NRF5x/TARGET_NRF51/TARGET_MCU_NRF51822_UNIFIED/sdk/nrf_drv_config.h#L60, which should be 64MHz. Both targets are using Mbed OS underneath.

Here's a code sample of a pretty simple matrix multiplication in software. We see similar slowdown when using CMSIS-DSP:

#include "mbed.h"

/**
 * A matrix structure that allocates a matrix on the heap.
 * Freeing happens by calling `delete` on the object or letting the object go out of scope.
 */
typedef struct ei_matrix2 {
    float *buffer;
    uint16_t rows;
    uint16_t cols;
    bool buffer_managed_by_me;

    /**
     * Create a new matrix
     * @param n_rows Number of rows
     * @param n_cols Number of columns
     * @param a_buffer Buffer, if not provided we'll alloc on the heap
     */
    ei_matrix2(
        uint16_t n_rows,
        uint16_t n_cols,
        float *a_buffer = NULL
        )
    {
        if (a_buffer) {
            buffer = a_buffer;
            buffer_managed_by_me = false;
        }
        else {
            buffer = (float*)calloc(n_rows * n_cols * sizeof(float), 1);
            buffer_managed_by_me = true;
        }
        rows = n_rows;
        cols = n_cols;
    }

    ~ei_matrix2() {
        if (buffer && buffer_managed_by_me) {
            free(buffer);
        }
    }
} matrix2_t;

/**
* Multiply two matrices lazily per row in matrix 1 (MxN * NxK matrix)
* @param i matrix1 row index
* @param row matrix1 row
* @param matrix1_cols matrix1 row size (1xN)
* @param matrix2 Pointer to matrix2 (NxK)
* @param out_matrix Pointer to out matrix (MxK)
* @returns 0 if OK
*/
static inline int dot_by_row(int i, float *row, size_t matrix1_cols, matrix2_t *matrix2, matrix2_t *out_matrix) {
    if (matrix1_cols != matrix2->rows) {
        return -1;
    }

    for (size_t j = 0; j < matrix2->cols; j++) {
        for (size_t k = 0; k < matrix1_cols; k++) {
            out_matrix->buffer[i * matrix2->cols + j] +=
                row[k] * matrix2->buffer[k * matrix2->cols + j];
        }
    }

    return 0;
}

/**
* Multiply two matrices (MxN * NxK matrix)
* @param matrix1 Pointer to matrix1 (MxN)
* @param matrix2 Pointer to quantized matrix2 (NxK)
* @param out_matrix Pointer to out matrix (MxK)
* @returns 0 if OK
*/
static int dot(matrix2_t *matrix1,
                matrix2_t *matrix2,
                matrix2_t *out_matrix)
{
    if (matrix1->cols != matrix2->rows) {
        return -1;
    }

    // no. of rows in matrix1 determines the
    if (matrix1->rows != out_matrix->rows || matrix2->cols != out_matrix->cols) {
        return -1;
    }

    memset(out_matrix->buffer, 0, out_matrix->rows * out_matrix->cols * sizeof(float));

    for (size_t i = 0; i < matrix1->rows; i++) {
        dot_by_row(i,
            matrix1->buffer + (i * matrix1->cols),
            matrix1->cols,
            matrix2,
            out_matrix);
    }

    return 0;
}

int main() {
    Timer t;
    t.start();
    uint64_t ticks = 0;
    while (1) {
        if (++ticks > 10000) {
            break;
        }


        float matrix1_buffer[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
        float matrix2_buffer[] = { 3, 2, 1, 6, 5, 4, 9, 8, 7 };
        ei_matrix matrix1(3, 3, matrix1_buffer);
        ei_matrix matrix2(3, 3, matrix2_buffer);
        ei_matrix out_matrix(3, 3);

        int r = dot(&matrix1, &matrix2, &out_matrix);
        if (r != 0) {
            printf("matrix multiply failed (%d)\n", r);
            break;
        }
    }
    t.stop();
    printf("matrix multiplications took %d ms. (%llu)\n", t.read_ms(), ticks);
}

This takes 326ms. on the ST L475, and 2403ms. on the Nano 33 BLE.

Anyone any idea on how we can check the actual frequency without an oscilloscope?

Edit: I managed to speed this up to 786ms. by enabling `-mfpu=fpv4-sp-d16`, setting `-fmloat-abi=hard` (over softfp), and placing and executing the matrix functions in RAM. Still a significant slowdown unfortunately.

Parents
  • Hi Jan, 

    the CPU frequency of the nRF52840 is not configurable so there must be another explanation for the slower execution. Could it be that the FPU on the nRF52840 isnt enabled in the compilation settings?How are you compiling the source code? GCC?

    I copy pasted this from another DevZone case:

    You can enable the FPU in your compiler setting.

    For Keil and IAR, you can find the "hardware floating point" option in the project settings, while in GCC you set this in the makefile: 

    CFLAGS += -mfloat-abi=hard -mfpu=fpv4-sp-d16

    LDFLAGS += -mfloat-abi=hard -mfpu=fpv4-sp-d16

     

    All the above methods will define __FPU_USED, which will enable the FPU in system_nrf52.c:

    /* Enable the FPU if the compiler used floating point unit instructions.

    * __FPU_USED is a MACRO defined by the

    * compiler. Since the FPU consumes energy, remember to disable FPU use in the

    * compiler if floating point unit

    * operations are not used in your code. */

    #if (__FPU_USED == 1)

    SCB->CPACR |= (3UL << 20) | (3UL << 22);

    __DSB();

    __ISB();

    #endif

     

    __DSB and __ISB are intrinsics, which are C-wrapped assembly-calls.

    Their function is to provide "Data synchronization barrier" and "Instruction synchronization barrier", to ensure that the write of the memory/instruction is performed before moving along in the execution of code.

    The ARM infocenter website didn't want to load at my end, but I found this stackoverflow thread that explains the functionality:

    http://stackoverflow.com/questions/15491751/real-life-use-cases-of-barriers-dsb-dmb-isb-in-arm

Reply
  • Hi Jan, 

    the CPU frequency of the nRF52840 is not configurable so there must be another explanation for the slower execution. Could it be that the FPU on the nRF52840 isnt enabled in the compilation settings?How are you compiling the source code? GCC?

    I copy pasted this from another DevZone case:

    You can enable the FPU in your compiler setting.

    For Keil and IAR, you can find the "hardware floating point" option in the project settings, while in GCC you set this in the makefile: 

    CFLAGS += -mfloat-abi=hard -mfpu=fpv4-sp-d16

    LDFLAGS += -mfloat-abi=hard -mfpu=fpv4-sp-d16

     

    All the above methods will define __FPU_USED, which will enable the FPU in system_nrf52.c:

    /* Enable the FPU if the compiler used floating point unit instructions.

    * __FPU_USED is a MACRO defined by the

    * compiler. Since the FPU consumes energy, remember to disable FPU use in the

    * compiler if floating point unit

    * operations are not used in your code. */

    #if (__FPU_USED == 1)

    SCB->CPACR |= (3UL << 20) | (3UL << 22);

    __DSB();

    __ISB();

    #endif

     

    __DSB and __ISB are intrinsics, which are C-wrapped assembly-calls.

    Their function is to provide "Data synchronization barrier" and "Instruction synchronization barrier", to ensure that the write of the memory/instruction is performed before moving along in the execution of code.

    The ARM infocenter website didn't want to load at my end, but I found this stackoverflow thread that explains the functionality:

    http://stackoverflow.com/questions/15491751/real-life-use-cases-of-barriers-dsb-dmb-isb-in-arm

Children
Related