In theory all driver calls are suspect, but for example I see crashes in nrf_802154_buffer_free_raw, at this assert:
result = nrf_802154_request_buffer_free(p_data);
NRF_802154_ASSERT(result);
This is easily reproducible by running 802154_phy_test example on nrf52840. It crashes less an a second after startup. I also tried openthread coap_server example and it always crashes after a few seconds as well.
So the investigation:
nrf_802154_request_buffer_free is really a swi call to nrf_802154_core_notify_buffer_free. And nrf_802154_core_notify_buffer_free always returns true, so the return value is not getting back correctly. The way return values are passed back from swi calls is a pointer where the value will be written is pushed to the queue along with the params. Now let's have a look at the assembly to understand what went wrong:
/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:274
44cb8: 480a ldr r0, [pc, #40] @ (44ce4 <nrf_802154_request_buffer_free+0x7c>)
44cba: f7ff fd6d bl 44798 <nrf_802154_queue_push_commit>
nrf_egu_task_trigger():
/home/andrew/ncs/v3.2.4/modules/hal/nordic/nrfx/hal/nrf_egu.h:333
44cbe: 4b0a ldr r3, [pc, #40] @ (44ce8 <nrf_802154_request_buffer_free+0x80>)
44cc0: 2201 movs r2, #1
44cc2: 609a str r2, [r3, #8]
req_exit():
/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:278
44cc4: 682b ldr r3, [r5, #0]
__set_PRIMASK():
/home/andrew/ncs/v3.2.4/modules/hal/cmsis_6/CMSIS/Core/Include/m-profile/cmsis_gcc_m.h:396
44cc6: f383 8810 msr PRIMASK, r3
nrf_802154_request_buffer_free():
/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:767
44cca: f89d 0007 ldrb.w r0, [sp, #7]
44cce: b003 add sp, #12
44cd0: bd30 pop {r4, r5, pc}
req_enter():
/home/andrew/ncs/v3.2.4/nrfxlib/nrf_802154/driver/src/nrf_802154_request_swi.c:261
44cd2: f7fc fd03 bl 416dc <nrf_802154_assert_handler>
44cd6: e7e5 b.n 44ca4 <nrf_802154_request_buffer_free+0x3c>
...
44ce8: 40014000 .word 0x40014000
This instruction writes to the EGU register requesting an interrupt `44cc2: str r2, [r3, #8]`
Two instructions later, this instruction unmasks interrupts: `44cc6: msr PRIMASK, r3`
The very next instruction the return value is read from the memory location where it's expected to be written: `44cca: ldrb.w r0, [sp, #7]`
I hope it's obvious by now what can go wrong ... what if the interrupt gets delivered a few cycles later? The we read the return value before it is written! Running in a debugger we can see that the interrupt sometimes actually gets delivered when executing the pop instructions, two instruction later.
To confirm this is the issue, I tried adding a few nops after unmasking interrupts, and lo and behold, no more crashes.
static void req_exit(void)
{
nrf_802154_queue_push_commit(&m_requests_queue);
nrf_egu_task_trigger(NRF_802154_EGU_INSTANCE, REQ_TASK);
nrf_802154_mcu_critical_exit(m_mcu_cs);
for (int i = 0; i < 4; i++)
__asm__ volatile("nop");
}
I don't think this is an appropriate long term fix, I think you need to have a flag that's set after the request is actually executed, e.g. in irq_handler_req_event, and spin on that flag, of course now risking that it spins forever :( Or even better, stop doing all this crap and use SVC.
P.S. seems that other people have also hit this issue, e.g. devzone.nordicsemi.com/.../558728