Arduino Nano 33 BLE (nRF52840) seems not to run on max. clock speed

Question

We're porting over some software to the Arduino Nano 33 BLE Sense. This board comes with an nRF52840 clocked at a max. of 64MHz. However, the same code runs about 2x slower on this board than on the ST L475VG platform, which also spans a Cortex-M4F (running at 80MHz). This makes me suspect that we're not running the board at the max. frequency (but rather at 16MHz or something). The only place where the main clock is defined seems to be here: https://github.com/arduino/ArduinoCore-nRF528x-mbedos/blob/beac74ca3cd9d07363f66cf9cda6b143e4385cd2/cores/arduino/mbed/targets/TARGET_NORDIC/TARGET_NRF5x/TARGET_NRF51/TARGET_MCU_NRF51822_UNIFIED/sdk/nrf_drv_config.h#L60, which should be 64MHz. Both targets are using Mbed OS underneath.

Here's a code sample of a pretty simple matrix multiplication in software. We see similar slowdown when using CMSIS-DSP:

#include "mbed.h"

/**
 * A matrix structure that allocates a matrix on the heap.
 * Freeing happens by calling `delete` on the object or letting the object go out of scope.
 */
typedef struct ei_matrix2 {
    float *buffer;
    uint16_t rows;
    uint16_t cols;
    bool buffer_managed_by_me;

    /**
     * Create a new matrix
     * @param n_rows Number of rows
     * @param n_cols Number of columns
     * @param a_buffer Buffer, if not provided we'll alloc on the heap
     */
    ei_matrix2(
        uint16_t n_rows,
        uint16_t n_cols,
        float *a_buffer = NULL
        )
    {
        if (a_buffer) {
            buffer = a_buffer;
            buffer_managed_by_me = false;
        }
        else {
            buffer = (float*)calloc(n_rows * n_cols * sizeof(float), 1);
            buffer_managed_by_me = true;
        }
        rows = n_rows;
        cols = n_cols;
    }

    ~ei_matrix2() {
        if (buffer && buffer_managed_by_me) {
            free(buffer);
        }
    }
} matrix2_t;

/**
* Multiply two matrices lazily per row in matrix 1 (MxN * NxK matrix)
* @param i matrix1 row index
* @param row matrix1 row
* @param matrix1_cols matrix1 row size (1xN)
* @param matrix2 Pointer to matrix2 (NxK)
* @param out_matrix Pointer to out matrix (MxK)
* @returns 0 if OK
*/
static inline int dot_by_row(int i, float *row, size_t matrix1_cols, matrix2_t *matrix2, matrix2_t *out_matrix) {
    if (matrix1_cols != matrix2->rows) {
        return -1;
    }

    for (size_t j = 0; j < matrix2->cols; j++) {
        for (size_t k = 0; k < matrix1_cols; k++) {
            out_matrix->buffer[i * matrix2->cols + j] +=
                row[k] * matrix2->buffer[k * matrix2->cols + j];
        }
    }

    return 0;
}

/**
* Multiply two matrices (MxN * NxK matrix)
* @param matrix1 Pointer to matrix1 (MxN)
* @param matrix2 Pointer to quantized matrix2 (NxK)
* @param out_matrix Pointer to out matrix (MxK)
* @returns 0 if OK
*/
static int dot(matrix2_t *matrix1,
                matrix2_t *matrix2,
                matrix2_t *out_matrix)
{
    if (matrix1->cols != matrix2->rows) {
        return -1;
    }

    // no. of rows in matrix1 determines the
    if (matrix1->rows != out_matrix->rows || matrix2->cols != out_matrix->cols) {
        return -1;
    }

    memset(out_matrix->buffer, 0, out_matrix->rows * out_matrix->cols * sizeof(float));

    for (size_t i = 0; i < matrix1->rows; i++) {
        dot_by_row(i,
            matrix1->buffer + (i * matrix1->cols),
            matrix1->cols,
            matrix2,
            out_matrix);
    }

    return 0;
}

int main() {
    Timer t;
    t.start();
    uint64_t ticks = 0;
    while (1) {
        if (++ticks > 10000) {
            break;
        }


        float matrix1_buffer[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
        float matrix2_buffer[] = { 3, 2, 1, 6, 5, 4, 9, 8, 7 };
        ei_matrix matrix1(3, 3, matrix1_buffer);
        ei_matrix matrix2(3, 3, matrix2_buffer);
        ei_matrix out_matrix(3, 3);

        int r = dot(&matrix1, &matrix2, &out_matrix);
        if (r != 0) {
            printf("matrix multiply failed (%d)\n", r);
            break;
        }
    }
    t.stop();
    printf("matrix multiplications took %d ms. (%llu)\n", t.read_ms(), ticks);
}

This takes 326ms. on the ST L475, and 2403ms. on the Nano 33 BLE.

Anyone any idea on how we can check the actual frequency without an oscilloscope?

Edit: I managed to speed this up to 786ms. by enabling `-mfpu=fpv4-sp-d16`, setting `-fmloat-abi=hard` (over softfp), and placing and executing the matrix functions in RAM. Still a significant slowdown unfortunately.