Application freeze or recursive spinlock assert using k_msgq in MPSL timeslot

Hello,

I am working on a project with the MPSL timeslots. There I am using the radio, and at certain radio interrupts I want to send data to another thread. For this purpose I want to use a message queue.

However, something is not working correctly, as the program typically freezes after a few minutes, sometimes even after a few seconds.

I created a simpler version of my project based on the timeslot example where I also managed to reproduce this problem. I left out the radio activity part and I am now just sending several messages in the case of MPSL_TIMESLOT_SIGNAL_START.

The message queue is defined as following:

K_MSGQ_DEFINE(data_msgq, 1, 120, 1);

In the case of MPSL_TIMESLOT_SIGNAL_START I am sending several messages. (In my actual program I would just send one data item, but the interrupt itself would happen several times within one timeslot. But this seems to produce the same problem.)

static uint8_t test_val = 0;

for(int count = 0; count < 20; count ++)
{
    test_val++;
    k_msgq_put(&data_msgq, &test_val, K_NO_WAIT);
}

Another thread periodically tries to request the next earliest timeslot and then (after some sleep) tries to read all available data from the queue (with a maximum allowed limit to read):

uint8_t data_recv;
uint8_t max_data_count = 0;

do {
    if(k_msgq_get(&data_msgq, &data_recv, K_NO_WAIT) != 0)
    {
        break;
    }
    
    LOG_INF("%d", data_recv);

    max_data_count++;

    if(max_data_count == 40)
    {
        break;
    }
}while(1);

As already stated, the problem is, that after some time the program completely freezes, no more log messages are printed. I also added several GPIO outputs for testing. Everything stops, even the separate main thread that just would toggle a GPIO output every 15 ms is not executed anymore.

void main(void)
{
    while(1)
    {
        gpio_pin_toggle_dt(&gpio_main);
        k_sleep(K_MSEC(15));
    }
}

From the measurements it seems that the freeze happens within MPSL_TIMESLOT_SIGNAL_START where it appears to stop when trying to add an element to the message queue, where it halts and then never finishes the interrupt routine. But from the implementation it should not be possible to wait there (K_NO_WAIT), something else would even be not allowed as it is running in the context of the TIMER0 interrupt.

When setting the log mode to CONFIG_LOG_MODE_DEFERRED it now did not just freeze, but I also had an assert message printed before an automatic reboot, which names a recursive spinlock as the issue. I have already gotten such spinlock asserts when trying out other things with the MPSL timeslots. I never investigated this in detail, but it seems that this happened when doing much logging within the interrupts itself or scheduling several work queue tasks within the Interrupts, which I am both not doing in this example. But maybe the current issue is still somehow related.

00> ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/zephyr/spinlock.h:148
00> Recursive spinlock 0x200012dc
00> [00:01:04.149,536] <err> os: ***** HARD FAULT *****
00> [00:01:04.149,902] <err> os: Fault escalation (see below)
00> [00:01:04.150,299] <err> os: ARCH_EXCEPT with reason 4
00>
00> [00:01:04.150,695] <err> os: r0/a1: 0x00000004 r1/a2: 0x00000094 r2/a3: 0x00000000
00> [00:01:04.151,214] <err> os: r3/a4: 0x20000bac r12/ip: 0x00000000 r14/lr: 0x0000ae8d
00> [00:01:04.151,733] <err> os: xpsr: 0x41000018
00> [00:01:04.152,130] <err> os: Faulting instruction address (r15/pc): 0x0000c164
00> [00:01:04.152,587] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
00> [00:01:04.153,076] <err> os: Fault during interrupt handling
00>
00> [00:01:04.153,442] <err> os: Current thread: 0x20000a20 (main)
00> [00:01:04.163,879] <err> fatal_error: Resetting system

The error is more consistently to reproduce, when always less messages are read from the queue than put into it. So the queue slowly fills up. Also sending more messages and reducing the timeslot duration increases the chances of getting an error early.

The weird thing is, that the logging seems to have an impact. Because, when disabling the logging completely I at least could not see an error occurring according to the GPIO traces.

But it cannot be the logging alone that causes the problems. When not using the message queue, but just storing the data in a global array from which the other thread reads the data, also no problems arise, even with the same amount of log output.

Any ideas what might be causing the problems?

In case it helps, here a link to the project on GitHub.

Parents
  • Hi Bernhard

    Could it be that you are staying too long in the timeslot when the message queue fills up, since you are printing to the log every time k_msgq_put(..) fails? 

    You mention a 500us runtime for the test when sending 100 messages, but I assume this is when there is room in the message queue for all of them? 

    What if you keep the log message on the error, but exit the loop as soon as an error is returned (to avoid multiple errors in a row). Will the issue still happen?

    Best regards
    Torbjørn

Reply
  • Hi Bernhard

    Could it be that you are staying too long in the timeslot when the message queue fills up, since you are printing to the log every time k_msgq_put(..) fails? 

    You mention a 500us runtime for the test when sending 100 messages, but I assume this is when there is room in the message queue for all of them? 

    What if you keep the log message on the error, but exit the loop as soon as an error is returned (to avoid multiple errors in a row). Will the issue still happen?

    Best regards
    Torbjørn

Children
  • Hi Torbjørn,

    thanks for looking into my issue.

    Maybe the fast example is a little bit extreme, but the application freezes can happen even when the buffer never gets full in the slow example.

    In the example the timing ensures that in one cycle all messages that are put in the buffer are also read afterwards. When activating the option that the buffer fills up, this causes that (after some time) per cycle only the last two of the 20 (or 100 in the fast case) messages have no space in the buffer anymore. Exiting the loop in this case does not resolve the problem.

    I commented out the logging when the buffer is full, as I wanted to make sure that the error is not appearing due to logging in the interrupt routine. In my actual application there sometimes also happened freezes when not using a message queue and just having some log messages in the interrupt routine (which still should be allowed), but for now I just wanted to focus on the problem the message queue might be causing.

    The timing should not be a problem. Here some of my captures. In the top there is the main thread toggling every 15 ms. The next two rows indicate whether the timeslot request requested by the mpsl_timeslot_demo task is granted or not. The next few rows are toggled on certain timeslot events (start of timeslot, TIMER0 interrupt, when the idle status is reached, the others are actual not relevant here). The last GPIO is configured to toggle before reading a value from the message queue in the mpsl_timeslot_demo thread and before putting an element in the queue in the timeslot.

    This is how it looks as long everything is working. 

    Between two timeslots there are about 100 ms, as i am always sleeping for that long before there is an attempt of requesting a new timeslot. From the timing chosen there is never the possibility that the timeslot request (guarded by a semaphore) is denied.

    The duration of one timeslot is 6.5 ms (it is configured 6.7 ms, but the TIMER0 triggers 200 us earlier) and in this implementation the timeslot is not ended otherwise. In my actual application I would end the timeslot when the sending of all data is finished, but keeping the timeslot open until TIMER0 triggers it can easily be seen that the operation fits without problems in the timeslot. Before the request of the timeslot 20 messages are read from the message queue. Putting the 20 values in the message queue after the start of the timeslot is even faster and took just 116 us.

    And this is how it looks when it freezes. The rising edge at the bottom happens before the first element will be put in the message queue, and then no GPIO changes anymore.

    Even if I would stay too long in the timeslot I would assume from the documentation to get into the case MPSL_TIMESLOT_SIGNAL_OVERSTAYED which then should trigger an assert. However, from the timings it seems that it just freezes at the beginning of the timeslot without any further execution.

    I looked into the actual code of the message queue. I guess that it reaches k_spin_lock(), but never k_spin_unlock(). This could somehow also relate to the spinlock asserts in some cases.

    Sidenote: From the implementation I would guess, that when the message queue is full, the processing should even be faster, as the message has not to be saved somewhere.

    int z_impl_k_msgq_put(struct k_msgq *msgq, const void *data, k_timeout_t timeout)
    {
    	__ASSERT(!arch_is_in_isr() || K_TIMEOUT_EQ(timeout, K_NO_WAIT), "");
    
    	struct k_thread *pending_thread;
    	k_spinlock_key_t key;
    	int result;
    
    	key = k_spin_lock(&msgq->lock);
    
    	SYS_PORT_TRACING_OBJ_FUNC_ENTER(k_msgq, put, msgq, timeout);
    
    	if (msgq->used_msgs < msgq->max_msgs) {
    		/* message queue isn't full */
    		pending_thread = z_unpend_first_thread(&msgq->wait_q);
    		if (pending_thread != NULL) {
    			SYS_PORT_TRACING_OBJ_FUNC_EXIT(k_msgq, put, msgq, timeout, 0);
    
    			/* give message to waiting thread */
    			(void)memcpy(pending_thread->base.swap_data, data,
    			       msgq->msg_size);
    			/* wake up waiting thread */
    			arch_thread_return_value_set(pending_thread, 0);
    			z_ready_thread(pending_thread);
    			z_reschedule(&msgq->lock, key);
    			return 0;
    		} else {
    			/* put message in queue */
    			(void)memcpy(msgq->write_ptr, data, msgq->msg_size);
    			msgq->write_ptr += msgq->msg_size;
    			if (msgq->write_ptr == msgq->buffer_end) {
    				msgq->write_ptr = msgq->buffer_start;
    			}
    			msgq->used_msgs++;
    #ifdef CONFIG_POLL
    			handle_poll_events(msgq, K_POLL_STATE_MSGQ_DATA_AVAILABLE);
    #endif /* CONFIG_POLL */
    		}
    		result = 0;
    	} else if (K_TIMEOUT_EQ(timeout, K_NO_WAIT)) {
    		/* don't wait for message space to become available */
    		result = -ENOMSG;
    	} else {
    		SYS_PORT_TRACING_OBJ_FUNC_BLOCKING(k_msgq, put, msgq, timeout);
    
    		/* wait for put message success, failure, or timeout */
    		_current->base.swap_data = (void *) data;
    
    		result = z_pend_curr(&msgq->lock, key, &msgq->wait_q, timeout);
    		SYS_PORT_TRACING_OBJ_FUNC_EXIT(k_msgq, put, msgq, timeout, result);
    		return result;
    	}
    
    	SYS_PORT_TRACING_OBJ_FUNC_EXIT(k_msgq, put, msgq, timeout, result);
    
    	k_spin_unlock(&msgq->lock, key);
    
    	return result;
    }

    As already stated in my initial question, I was getting similar issues when submitting work queue items in the timeslot interrupts. I checked its implementation, and submitting work queue items is also using spinlocks. I could not figure out whether logging itself may also be using spinlocks in some cases. If so, I guess that all freezes by using message queues, work item submitting or logging in the interrupts are somehow caused by spinlock problems.

    But the question is still how these problems can be prevented.

    Best Regards,

    Bernhard

     

  • Hi Bernhard

    I think we should start by verifying whether or not this is related to logging only, or if the logging is just one of the issues. 

    When using deferred logging it should be fine to write log messages in interrupt context, but I see that this still fails on my side. 

    One thing that does work is to switch over to the RTT backend where logs are printed through the debug interface rather than the UART, then I can write log messages in the timeslot callback without any apparent issues. 

    Are you able to try this on your end to see if it fixes all the issues, or if you can still get spinlock issues even with RTT logging enabled? 

    Once we know whether or not RTT logging fixes it we can look more specifically at why the UART log backend is unreliable when used in the timeslot. 

    To enable RTT logging simply add the following lines to your project configuration:

    CONFIG_LOG_BACKEND_UART=n 
    CONFIG_LOG_BACKEND_RTT=y

    Afterwards, use the Segger J-Link RTT Viewer application to view the logs. 

    Best regards
    Torbjørn

  • Hi Torbjørn,

    I tried out adding the configuration you provided. However, the same issues still happen from time to time (freezes with deferred logging and the recursive spinlock assert when using immediate logging).

    In my main application, where I am also using the radio interrupts I even had freezes appearing when trying to set CONFIG_LOG=n. Of course, as I am doing much more there I cannot guarantee that not something else was causing the problem there. At least so far I could not see problems when the logging is turned off completely in the example application.

    Best Regards,

    Bernhard

  • Hi Bernhard

    Ignore my last reply, shortly after sending it I got another crash (with RTT logging enabled). 

    After reading more into the spin lock documentation it seems clear that if a spin lock is already locked by a thread, and you try to lock it again from an interrupt, you will get a crash (deadlock). 

    To prevent this from being an issue the k_spin_lock function will run arch_irq_lock() at the start, but the arch_irq_lock() function will not lock so called zero latency IRQ's. 

    The MPSL library seems to enable zero latency IRQ's by default as shown here, with the implication that the timeslot interrupts might be able to interrupt a spin lock that is already locked, leading to the recursive spinlock error. 

    If the theory is correct it means you can't call any Zephyr kernel functions from the MPSL_TIMESLOT_SIGNAL_START or MPSL_TIMESLOT_SIGNAL_TIMER0 callbacks (including the k_msgq API and the LOG API), but I need some more time to confirm this. 

    A workaround to the issue can be seen in this example.
    I have seen this code before, but to be honest I didn't understand at the time why they used a ring buffer and an SWI interrupt to process some of the timeslot callbacks. Doing this ensures you are deferring execution to a regular interrupt (not a zero latency IRQ), allowing you to run Zephyr kernel functions safely. 

    Best regards
    Torbjørn

  • Hi Torbjørn,

    thanks a lot for the further investigation and the information you found out, that makes a lot of sense. I also wondered in the timeslot example why the ring buffer is used.

    I guess that my application that also uses the radio interrupts (where I already got rid of the message queue and logs) then likely crashed due to a k_sem_give, which also uses a spinlock. I have to test a little bit more, but at least from a quick test not using the semaphore it seems to work, even with heavy logging. Maybe I can somehow make use of this workaround to preserve the logic of my application.

    However, knowing about this low latency interrupt I now have some general follow up questions about the MPSL/timeslots, just to be sure that I am using everything correctly:

    • I saw that in the MPSL documentation it is stated: "The application must configure interrupts for RTC0, TIMER0, and RADIO for priority level 0 (MPSL_HIGH_IRQ_PRIORITY)."
      As even the sample application did not explicitly set the interrupts I also did not do this so far. Should this actually be done, or is this done probably automatically when using the timeslots (and not just using CONFIG_MPSL=y alone)?
    • The zero latency interrupts then only apply to the cases where the TIMER0 or the RADIO interrupt trigger, so the restrictions do not apply to the context run in mpsl_low_priority_process?
    • Even with the zero latency interrupt a TIMER0 interrupt would not interrupt a RADIO interrupt (and vice versa) as they are at the same priority (MPSL_HIGH_IRQ_PRIORITY)?

    Best Regards,

    Bernhard

Related