802.15.4 stack unusable due to immediate crashes stemming from unsound swi usage in driver code

Question

In theory all driver calls are suspect, but for example I see crashes in nrf_802154_buffer_free_raw, at this assert:

    result = nrf_802154_request_buffer_free(p_data);
    NRF_802154_ASSERT(result);

This is easily reproducible by running 802154_phy_test example on nrf52840. It crashes less an a second after startup. I also tried openthread coap_server example and it always crashes after a few seconds as well.

So the investigation:

nrf_802154_request_buffer_free is really a swi call to nrf_802154_core_notify_buffer_free. And nrf_802154_core_notify_buffer_free always returns true, so the return value is not getting back correctly. The way return values are passed back from swi calls is a pointer where the value will be written is pushed to the queue along with the params. Now let's have a look at the assembly to understand what went wrong:

/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:274
   44cb8:       480a            ldr     r0, [pc, #40]   @ (44ce4 <nrf_802154_request_buffer_free+0x7c>)
   44cba:       f7ff fd6d       bl      44798 <nrf_802154_queue_push_commit>
nrf_egu_task_trigger():
/home/andrew/ncs/v3.2.4/modules/hal/nordic/nrfx/hal/nrf_egu.h:333
   44cbe:       4b0a            ldr     r3, [pc, #40]   @ (44ce8 <nrf_802154_request_buffer_free+0x80>)
   44cc0:       2201            movs    r2, #1
   44cc2:       609a            str     r2, [r3, #8]
req_exit():
/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:278
   44cc4:       682b            ldr     r3, [r5, #0]
__set_PRIMASK():
/home/andrew/ncs/v3.2.4/modules/hal/cmsis_6/CMSIS/Core/Include/m-profile/cmsis_gcc_m.h:396
   44cc6:       f383 8810       msr     PRIMASK, r3
nrf_802154_request_buffer_free():
/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:767
   44cca:       f89d 0007       ldrb.w  r0, [sp, #7]
   44cce:       b003            add     sp, #12
   44cd0:       bd30            pop     {r4, r5, pc}
req_enter():
/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:261
   44cd2:       f7fc fd03       bl      416dc <nrf_802154_assert_handler>
   44cd6:       e7e5            b.n     44ca4 <nrf_802154_request_buffer_free+0x3c>
...
   44ce8:       40014000        .word   0x40014000

This instruction writes to the EGU register requesting an interrupt `44cc2: str r2, [r3, #8]`

Two instructions later, this instruction unmasks interrupts: `44cc6: msr PRIMASK, r3`

The very next instruction the return value is read from the memory location where it's expected to be written: `44cca: ldrb.w r0, [sp, #7]`

I hope it's obvious by now what can go wrong ... what if the interrupt gets delivered a few cycles later? The we read the return value before it is written! Running in a debugger we can see that the interrupt sometimes actually gets delivered when executing the pop instructions, two instruction later.

To confirm this is the issue, I tried adding a few nops after unmasking interrupts, and lo and behold, no more crashes.

static void req_exit(void)
{
    nrf_802154_queue_push_commit(&m_requests_queue);

    nrf_egu_task_trigger(NRF_802154_EGU_INSTANCE, REQ_TASK);

    nrf_802154_mcu_critical_exit(m_mcu_cs);

    for (int i = 0; i < 4; i++)
        __asm__ volatile("nop");
}

I don't think this is an appropriate long term fix, I think you need to have a flag that's set after the request is actually executed, e.g. in irq_handler_req_event, and spin on that flag, of course now risking that it spins forever :( Or even better, stop doing all this crap and use SVC.

P.S. seems that other people have also hit this issue, e.g. devzone.nordicsemi.com/.../558728