This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Random hangs and crashes when sending IPC messages from netcore to appcore

There is already a case on github:

https://github.com/zephyrproject-rtos/zephyr/issues/44586

But essentially we see an issue with nrf5340, where we get random crashes and hangs when sending IPC messages from netcore to appcore. I have created a pretty barebone application which is linked to in the github issue that recreates the issue. I have ordered some nrf5340dk kits, but have not received them yet, but it should be pretty straight forward to port this application to nrf5340dk. The issue seems also very hardware dependent, and currently we only got two devices where the issue happens often, but it happens on all of our devices.

The application is sending a message from the netcore to the application core every second which then logs it using the logging infrastructure. The logging backbone is UART. What we see is that the application either hangs (most common) or reboots both with and without core dumps.

Parents
  • Hi,

     

    I have been trying to replicate this issue on my end with the fw that you attached in the github issue, without much success, unfortunately.

    Is the project configured with the standard "nrf5340dk_nrf5340_cpu*" boards, or are you using a custom board when configuring?

    Issue occurs if you disable logging as well? It might be beneficial to add a blinking LED to see if the fw still runs if you try this approach.

     

    Kind regards,

    Håkon

  • I ran a test without logging, only a blinking led from yesterday 1342 until now 0811. And it does not seem to have failed, it has not hung, and from the log it doesn't seem to have rebooted either. I ran this using the test application that usually will hang after 15 - 20 seconds. This kind of aligns with what we have seen before that of we set the logging in synchronous mode and not deferred it will run longer, but I have not done a test like this before without logging or with deferred logging.

    I will run another test over the weekend on the production firmware without logging and see if I get the same result.

    This does not explain why this happen, but can  in the short run be a ok workaround, and may limit the scope of where to look for what is actually going on.

    I also got another device now that fails, not with the same frequency as the other but it fails.

    BTW:

    I have not had any luck with the RTOS plugging in GDB, each time it crashes and I attached to the device it will be in a fault handler and most of the registers set to 0xdeadbeef, and no thread information.

  • Hi,

     

    Could you try these steps?

    * go back to a known "bad state" of the firmware, where the issue presents itself quickly (logging and all enabled), just to ensure that its still present

    * Make a small change, by disabling the UART RX. At the very top of main, add this snippet:

    NRF_UARTE0->TASKS_STOPRX = 1;
    while(NRF_UARTE0->EVENTS_RXTO == 0);
    NRF_UARTE0->EVENTS_RXTO = 0;

     

    Then see if it continues fails afterwards.

    Even Falch-Larsen said:
    I have not had any luck with the RTOS plugging in GDB, each time it crashes and I attached to the device it will be in a fault handler and most of the registers set to 0xdeadbeef, and no thread information.

    0xdeadbeef is usually a indication that the debugger connection was not successful. In the JLinkGDBServer log, you will most likely see some connection warnings being printed.

     

    Kind regards,

    Håkon

  • This seems to change the frequency for when it happens. Without this code it will typically happen (that it hangs) every 15-20 seconds on this device. Based on three runs now it have happen after 2, 4 and 5 minutes. When I tried to run this device just before I tried this code it happen after 17 seconds.

Reply
  • This seems to change the frequency for when it happens. Without this code it will typically happen (that it hangs) every 15-20 seconds on this device. Based on three runs now it have happen after 2, 4 and 5 minutes. When I tried to run this device just before I tried this code it happen after 17 seconds.

Children
  • Could you share your build/zephyr/zephyr.dts file? If there's more than one uart enabled, we'll need to disable RX on all of them.

    Did you try to run the same sequence of stopping UART RX on the netcore as well?

     

    Kind regards,

    Håkon

  • appcore_zephyr.dtsnetcore_zephyr.dts

    I disabled RX on UART instance 0 on netcore,

    NRF_UARTE0->TASKS_STOPRX = 1;
    while(NRF_UARTE0->EVENTS_RXTO == 0);
    NRF_UARTE0->EVENTS_RXTO = 0;

    And I disabled RX on UART instance 0 and 3 on appcore,

    NRF_UARTE0->TASKS_STOPRX = 1;
    while(NRF_UARTE0->EVENTS_RXTO == 0);
    NRF_UARTE0->EVENTS_RXTO = 0;
    
    NRF_UARTE3->TASKS_STOPRX = 1;
    while(NRF_UARTE3->EVENTS_RXTO == 0);
    NRF_UARTE3->EVENTS_RXTO = 0;

    This has been working without any crash for 18 minutes as of when I write this. This unusual on this device, but I will have to run it for much longer to be sure that this worked.

    I will however need RX on UART3 on appcore in the production firmware.

  • Unfortunately it seems like it still crashes. I ran a test with both our production firmware and the test application, both finally crashed. The test application is harder to crash now, but the production code (based on one run) crashed much faster than expected.

  • You have both UART0 and UART3 enabled on the appcore - are you using both these?

    Try disabling the unused uart in dts, in case this one is causing problems.

     

    Kind regards,

    Håkon

  • Yes, on appcore use UART0 for logging and UART3 for some communication we need in some cases. On netcore I use UART0 for logging. I may be able to disable logging in production because we currently don't have any interface to get that from the device in production. I was actually planning to run a test this weekend to se if the crashes is gone if I do that.

Related