Application freeze or recursive spinlock assert using k_msgq in MPSL timeslot

Hello,

I am working on a project with the MPSL timeslots. There I am using the radio, and at certain radio interrupts I want to send data to another thread. For this purpose I want to use a message queue.

However, something is not working correctly, as the program typically freezes after a few minutes, sometimes even after a few seconds.

I created a simpler version of my project based on the timeslot example where I also managed to reproduce this problem. I left out the radio activity part and I am now just sending several messages in the case of MPSL_TIMESLOT_SIGNAL_START.

The message queue is defined as following:

K_MSGQ_DEFINE(data_msgq, 1, 120, 1);

In the case of MPSL_TIMESLOT_SIGNAL_START I am sending several messages. (In my actual program I would just send one data item, but the interrupt itself would happen several times within one timeslot. But this seems to produce the same problem.)

static uint8_t test_val = 0;

for(int count = 0; count < 20; count ++)
{
    test_val++;
    k_msgq_put(&data_msgq, &test_val, K_NO_WAIT);
}

Another thread periodically tries to request the next earliest timeslot and then (after some sleep) tries to read all available data from the queue (with a maximum allowed limit to read):

uint8_t data_recv;
uint8_t max_data_count = 0;

do {
    if(k_msgq_get(&data_msgq, &data_recv, K_NO_WAIT) != 0)
    {
        break;
    }
    
    LOG_INF("%d", data_recv);

    max_data_count++;

    if(max_data_count == 40)
    {
        break;
    }
}while(1);

As already stated, the problem is, that after some time the program completely freezes, no more log messages are printed. I also added several GPIO outputs for testing. Everything stops, even the separate main thread that just would toggle a GPIO output every 15 ms is not executed anymore.

void main(void)
{
    while(1)
    {
        gpio_pin_toggle_dt(&gpio_main);
        k_sleep(K_MSEC(15));
    }
}

From the measurements it seems that the freeze happens within MPSL_TIMESLOT_SIGNAL_START where it appears to stop when trying to add an element to the message queue, where it halts and then never finishes the interrupt routine. But from the implementation it should not be possible to wait there (K_NO_WAIT), something else would even be not allowed as it is running in the context of the TIMER0 interrupt.

When setting the log mode to CONFIG_LOG_MODE_DEFERRED it now did not just freeze, but I also had an assert message printed before an automatic reboot, which names a recursive spinlock as the issue. I have already gotten such spinlock asserts when trying out other things with the MPSL timeslots. I never investigated this in detail, but it seems that this happened when doing much logging within the interrupts itself or scheduling several work queue tasks within the Interrupts, which I am both not doing in this example. But maybe the current issue is still somehow related.

00> ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/zephyr/spinlock.h:148
00> Recursive spinlock 0x200012dc
00> [00:01:04.149,536] <err> os: ***** HARD FAULT *****
00> [00:01:04.149,902] <err> os: Fault escalation (see below)
00> [00:01:04.150,299] <err> os: ARCH_EXCEPT with reason 4
00>
00> [00:01:04.150,695] <err> os: r0/a1: 0x00000004 r1/a2: 0x00000094 r2/a3: 0x00000000
00> [00:01:04.151,214] <err> os: r3/a4: 0x20000bac r12/ip: 0x00000000 r14/lr: 0x0000ae8d
00> [00:01:04.151,733] <err> os: xpsr: 0x41000018
00> [00:01:04.152,130] <err> os: Faulting instruction address (r15/pc): 0x0000c164
00> [00:01:04.152,587] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
00> [00:01:04.153,076] <err> os: Fault during interrupt handling
00>
00> [00:01:04.153,442] <err> os: Current thread: 0x20000a20 (main)
00> [00:01:04.163,879] <err> fatal_error: Resetting system

The error is more consistently to reproduce, when always less messages are read from the queue than put into it. So the queue slowly fills up. Also sending more messages and reducing the timeslot duration increases the chances of getting an error early.

The weird thing is, that the logging seems to have an impact. Because, when disabling the logging completely I at least could not see an error occurring according to the GPIO traces.

But it cannot be the logging alone that causes the problems. When not using the message queue, but just storing the data in a global array from which the other thread reads the data, also no problems arise, even with the same amount of log output.

Any ideas what might be causing the problems?

In case it helps, here a link to the project on GitHub.

  • Hi Torbjørn,

    thanks a lot for the further investigation and the information you found out, that makes a lot of sense. I also wondered in the timeslot example why the ring buffer is used.

    I guess that my application that also uses the radio interrupts (where I already got rid of the message queue and logs) then likely crashed due to a k_sem_give, which also uses a spinlock. I have to test a little bit more, but at least from a quick test not using the semaphore it seems to work, even with heavy logging. Maybe I can somehow make use of this workaround to preserve the logic of my application.

    However, knowing about this low latency interrupt I now have some general follow up questions about the MPSL/timeslots, just to be sure that I am using everything correctly:

    • I saw that in the MPSL documentation it is stated: "The application must configure interrupts for RTC0, TIMER0, and RADIO for priority level 0 (MPSL_HIGH_IRQ_PRIORITY)."
      As even the sample application did not explicitly set the interrupts I also did not do this so far. Should this actually be done, or is this done probably automatically when using the timeslots (and not just using CONFIG_MPSL=y alone)?
    • The zero latency interrupts then only apply to the cases where the TIMER0 or the RADIO interrupt trigger, so the restrictions do not apply to the context run in mpsl_low_priority_process?
    • Even with the zero latency interrupt a TIMER0 interrupt would not interrupt a RADIO interrupt (and vice versa) as they are at the same priority (MPSL_HIGH_IRQ_PRIORITY)?

    Best Regards,

    Bernhard

  • Hi Bernhard

    B3rni said:
    As even the sample application did not explicitly set the interrupts I also did not do this so far. Should this actually be done, or is this done probably automatically when using the timeslots (and not just using CONFIG_MPSL=y alone)?

    It shouldn't be necessary to set these priorities every time, as long as you are sure these priorities are not set to something else somewhere else in the application. 

    Myself I am working on an ESB/timeslot example. and got issues because the ESB library was setting the RADIO IRQ priority to something other than 0. 

    B3rni said:
    The zero latency interrupts then only apply to the cases where the TIMER0 or the RADIO interrupt trigger, so the restrictions do not apply to the context run in mpsl_low_priority_process?

    That is correct. Only the RADIO and TIMER0 callbacks are zero latency interrupts. 

    B3rni said:
    Even with the zero latency interrupt a TIMER0 interrupt would not interrupt a RADIO interrupt (and vice versa) as they are at the same priority (MPSL_HIGH_IRQ_PRIORITY)?

    Correct. 

    Best regards
    Torbjørn

  • Hi Torbjørn,

    thanks again for your help and answering the follow-up questions.

    I had my application running over night (over 10 hours) and there were no freezes/crashes.

    Best Regards,

    Bernhard

  • Hi Bernhard

    Thanks for sharing the good news!

    I will consider the case resolved then Slight smile

    Best regards
    Torbjørn

Related