Precise data bus error in psa_generate_random (mutex.c\nrf_cc3xx_platform_mutex_zephyr.c)

Hello,

ncs 2.6.0

I'm encountering a weird bus data error when I'm calling psa_generate_random inside a work queue thread (it has name "api"):

00> [00:01:54.764,038] <err> os: ***** BUS FAULT *****
00> [00:01:54.764,068] <err> os:   Precise data bus error
00> [00:01:54.764,068] <err> os:   BFAR Address: 0x40f04fc3
00> [00:01:54.764,099] <err> os: r0/a1:  0x40f04fb5  r1/a2:  0xffffffff  r2/a3:  0x20003678
00> [00:01:54.764,099] <err> os: r3/a4:  0x00000000 r12/ip:  0xffffffff r14/lr:  0x00014281
00> [00:01:54.764,099] <err> os:  xpsr:  0xa1000000
00> [00:01:54.764,129] <err> os: Faulting instruction address (r15/pc): 0x0002b4aa
00> [00:01:54.764,160] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
00> [00:01:54.764,190] <err> os: Current thread: 0x20003678 (api)
 

The code I'm running inside work queue is the following:

    uint8_t test[16];
    psa_status_t status;

    status = psa_generate_random(test, ARRAY_SIZE(test));
    if (status != PSA_SUCCESS) {
        LOG_ERR("gen rand err: %d", status);
        return;
    }

    LOG_HEXDUMP_INF(test, ARRAY_SIZE(test), "test data");

Faulting instruction address points to: workspace/zephyr/kernel/mutex.c:147 , which contains the following code (please see my comment in the code):

	SYS_PORT_TRACING_OBJ_FUNC_BLOCKING(k_mutex, lock, mutex, timeout);

	new_prio = new_prio_for_inheritance(_current->base.prio,
					    mutex->owner->base.prio); // THIS IS LINE #147

	LOG_DBG("adjusting prio up on mutex %p", mutex);

r14/lr points to: workspace/nrfxlib/crypto/nrf_cc310_platform/src/nrf_cc3xx_platform_mutex_zephyr.c:306, which contains the following code (please see my comment again):

        ret = k_mutex_lock(p_mutex, K_FOREVER);
        if (ret == 0) { // !!!! This is line #306 !!!!
            return NRF_CC3XX_PLATFORM_SUCCESS;
        } else {
            return NRF_CC3XX_PLATFORM_ERROR_MUTEX_FAILED;
        }

If I comment psa_generate_random - everything works just fine.

The fun part of this bug - it appears only when the data published to the work queue is from USB CDC ACM. When the same data comes from BLE, the psa_generate_random (in the same place) works just fine. So I might assume it could be somehow related to USB.

Please help to resolve this issue. 

Kind regards,

Oleh Hordiichuk 

  • Hi,

    Due to the Easter holidays in Norway, we are less staffed than usual. You can expect an answer sometime next week,

    Sorry for the inconvenience and thanks for your patience. 

    Regards, 
    Amanda

  • Hi, 

    it appears only when the data published to the work queue is from USB CDC ACM.

    I wonder what you mean by that. Is the "published" part done in the work context or does it come from another thread?

    Could you extend these arrays or put them outside the stack (make it static) to check if corruption does not happen in our code?

    Regards,
    Amanda H.

  • Hello Amanda,

    as suggested, I made the variables static and the result is same (bus error in the same place):

    void rx_work(struct k_work *item) {
        static uint8_t test[16];
        static psa_status_t status;
    
        status = psa_generate_random(test, ARRAY_SIZE(test));
        if (status != PSA_SUCCESS) {
            LOG_ERR("gen rand err: %d", status);
            return;
        }
        LOG_HEXDUMP_INF(test, ARRAY_SIZE(test), "test data");
        
    /// Here starts our code///
    

    > I wonder what you mean by that. Is the "published" part done in the work context or does it come from another thread?

    Let me clarify this. We process data that come either from BLE (characteristic write without response callback) or USB CDC ACM (in callback set via uart_irq_callback_set). These data copied into heap and submitted to the workqueue. The function that processes it is called "rx_work" as I pasted snippet above. I put psa_generate_random into very beginning of the "rx_work" in order to ensure our code doesn't impact on stack.

    So when the work is submitted from BLE - this code works well without BUS errors. But when it's submitted from USB CDC ACM callback - it crashes.

    If we remove psa_generate_random from the rx_work - we don't encounter any bus errors neither from USB CDC ACM nor from BLE data. 

    Kind regards,

    Oleh Hordiichuk

  • After additional investigations we found out that this was stack overflow issue in USB thread. So it's not related to psa_generate_random. However, I have few additional questions in that regard:

    1. What's so special about psa_generate_random that the kernel has detected Bus fault, but it was unable to detect it many other places of our\zephyr\nordic code, which was called by the same thread multiple times. Why only this function was capable to raise the error?

    2. It would be great if Nordic could create a guide on how to debug those kind of bugs, since they're very tricky and time consuming. Are there any best practices?

    For those who will encounter with similar bugs, the following config options may help you with debugging:

    CONFIG_THREAD_ANALYZER=y
    CONFIG_THREAD_STACK_INFO=y
    CONFIG_THREAD_MONITOR=y
    CONFIG_THREAD_ANALYZER_ISR_STACK_USAGE=y
    CONFIG_THREAD_ANALYZER_AUTO_INTERVAL=5
    CONFIG_THREAD_ANALYZER_AUTO=y
    CONFIG_THREAD_ANALYZER_USE_LOG=y
    CONFIG_STACK_CANARIES=y

Related