Application freeze or recursive spinlock assert using k_msgq in MPSL timeslot

Question

Hello,

I am working on a project with the MPSL timeslots. There I am using the radio, and at certain radio interrupts I want to send data to another thread. For this purpose I want to use a message queue.

However, something is not working correctly, as the program typically freezes after a few minutes, sometimes even after a few seconds.

I created a simpler version of my project based on the timeslot example where I also managed to reproduce this problem. I left out the radio activity part and I am now just sending several messages in the case of MPSL_TIMESLOT_SIGNAL_START.

The message queue is defined as following:

K_MSGQ_DEFINE(data_msgq, 1, 120, 1);

In the case of MPSL_TIMESLOT_SIGNAL_START I am sending several messages. (In my actual program I would just send one data item, but the interrupt itself would happen several times within one timeslot. But this seems to produce the same problem.)

static uint8_t test_val = 0;

for(int count = 0; count < 20; count ++)
{
    test_val++;
    k_msgq_put(&data_msgq, &test_val, K_NO_WAIT);
}

Another thread periodically tries to request the next earliest timeslot and then (after some sleep) tries to read all available data from the queue (with a maximum allowed limit to read):

uint8_t data_recv;
uint8_t max_data_count = 0;

do {
    if(k_msgq_get(&data_msgq, &data_recv, K_NO_WAIT) != 0)
    {
        break;
    }
    
    LOG_INF("%d", data_recv);

    max_data_count++;

    if(max_data_count == 40)
    {
        break;
    }
}while(1);

As already stated, the problem is, that after some time the program completely freezes, no more log messages are printed. I also added several GPIO outputs for testing. Everything stops, even the separate main thread that just would toggle a GPIO output every 15 ms is not executed anymore.

void main(void)
{
    while(1)
    {
        gpio_pin_toggle_dt(&gpio_main);
        k_sleep(K_MSEC(15));
    }
}

From the measurements it seems that the freeze happens within MPSL_TIMESLOT_SIGNAL_START where it appears to stop when trying to add an element to the message queue, where it halts and then never finishes the interrupt routine. But from the implementation it should not be possible to wait there (K_NO_WAIT), something else would even be not allowed as it is running in the context of the TIMER0 interrupt.

When setting the log mode to CONFIG_LOG_MODE_DEFERRED it now did not just freeze, but I also had an assert message printed before an automatic reboot, which names a recursive spinlock as the issue. I have already gotten such spinlock asserts when trying out other things with the MPSL timeslots. I never investigated this in detail, but it seems that this happened when doing much logging within the interrupts itself or scheduling several work queue tasks within the Interrupts, which I am both not doing in this example. But maybe the current issue is still somehow related.

00> ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/zephyr/spinlock.h:148
00> Recursive spinlock 0x200012dc
00> [00:01:04.149,536] <err> os: ***** HARD FAULT *****
00> [00:01:04.149,902] <err> os: Fault escalation (see below)
00> [00:01:04.150,299] <err> os: ARCH_EXCEPT with reason 4
00>
00> [00:01:04.150,695] <err> os: r0/a1: 0x00000004 r1/a2: 0x00000094 r2/a3: 0x00000000
00> [00:01:04.151,214] <err> os: r3/a4: 0x20000bac r12/ip: 0x00000000 r14/lr: 0x0000ae8d
00> [00:01:04.151,733] <err> os: xpsr: 0x41000018
00> [00:01:04.152,130] <err> os: Faulting instruction address (r15/pc): 0x0000c164
00> [00:01:04.152,587] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
00> [00:01:04.153,076] <err> os: Fault during interrupt handling
00>
00> [00:01:04.153,442] <err> os: Current thread: 0x20000a20 (main)
00> [00:01:04.163,879] <err> fatal_error: Resetting system

The error is more consistently to reproduce, when always less messages are read from the queue than put into it. So the queue slowly fills up. Also sending more messages and reducing the timeslot duration increases the chances of getting an error early.

The weird thing is, that the logging seems to have an impact. Because, when disabling the logging completely I at least could not see an error occurring according to the GPIO traces.

But it cannot be the logging alone that causes the problems. When not using the message queue, but just storing the data in a global array from which the other thread reads the data, also no problems arise, even with the same amount of log output.

Any ideas what might be causing the problems?

In case it helps, here a link to the project on GitHub.

ovrebekk · Accepted Answer

Hi Bernhard 
 Ignore my last reply, shortly after sending it I got another crash (with RTT logging enabled). 
 After reading more into the spin lock documentation it seems clear that if a spin lock is already locked by a thread, and you try to lock it again from an interrupt, you will get a crash (deadlock). 
 To prevent this from being an issue the k_spin_lock function will run arch_irq_lock() at the start, but the arch_irq_lock() function will not lock so called zero latency IRQ's. 
 The MPSL library seems to enable zero latency IRQ's by default as shown here , with the implication that the timeslot interrupts might be able to interrupt a spin lock that is already locked, leading to the recursive spinlock error. 
 If the theory is correct it means you can't call any Zephyr kernel functions from the MPSL_TIMESLOT_SIGNAL_START or MPSL_TIMESLOT_SIGNAL_TIMER0 callbacks (including the k_msgq API and the LOG API), but I need some more time to confirm this. 
 A workaround to the issue can be seen in this example . I have seen this code before, but to be honest I didn't understand at the time why they used a ring buffer and an SWI interrupt to process some of the timeslot callbacks. Doing this ensures you are deferring execution to a regular interrupt (not a zero latency IRQ), allowing you to run Zephyr kernel functions safely. 
 Best regards Torbjørn

Application freeze or recursive spinlock assert using k_msgq in MPSL timeslot

Top Replies