This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Random hangs and crashes when sending IPC messages from netcore to appcore

There is already a case on github:

https://github.com/zephyrproject-rtos/zephyr/issues/44586

But essentially we see an issue with nrf5340, where we get random crashes and hangs when sending IPC messages from netcore to appcore. I have created a pretty barebone application which is linked to in the github issue that recreates the issue. I have ordered some nrf5340dk kits, but have not received them yet, but it should be pretty straight forward to port this application to nrf5340dk. The issue seems also very hardware dependent, and currently we only got two devices where the issue happens often, but it happens on all of our devices.

The application is sending a message from the netcore to the application core every second which then logs it using the logging infrastructure. The logging backbone is UART. What we see is that the application either hangs (most common) or reboots both with and without core dumps.

Parents
  • Hi,

     

    I have been trying to replicate this issue on my end with the fw that you attached in the github issue, without much success, unfortunately.

    Is the project configured with the standard "nrf5340dk_nrf5340_cpu*" boards, or are you using a custom board when configuring?

    Issue occurs if you disable logging as well? It might be beneficial to add a blinking LED to see if the fw still runs if you try this approach.

     

    Kind regards,

    Håkon

  • I ran a test without logging, only a blinking led from yesterday 1342 until now 0811. And it does not seem to have failed, it has not hung, and from the log it doesn't seem to have rebooted either. I ran this using the test application that usually will hang after 15 - 20 seconds. This kind of aligns with what we have seen before that of we set the logging in synchronous mode and not deferred it will run longer, but I have not done a test like this before without logging or with deferred logging.

    I will run another test over the weekend on the production firmware without logging and see if I get the same result.

    This does not explain why this happen, but can  in the short run be a ok workaround, and may limit the scope of where to look for what is actually going on.

    I also got another device now that fails, not with the same frequency as the other but it fails.

    BTW:

    I have not had any luck with the RTOS plugging in GDB, each time it crashes and I attached to the device it will be in a fault handler and most of the registers set to 0xdeadbeef, and no thread information.

  • You have both UART0 and UART3 enabled on the appcore - are you using both these?

    Try disabling the unused uart in dts, in case this one is causing problems.

     

    Kind regards,

    Håkon

  • Yes, on appcore use UART0 for logging and UART3 for some communication we need in some cases. On netcore I use UART0 for logging. I may be able to disable logging in production because we currently don't have any interface to get that from the device in production. I was actually planning to run a test this weekend to se if the crashes is gone if I do that.

  • Hi,

    Even Falch-Larsen said:
    I was actually planning to run a test this weekend to se if the crashes is gone if I do that.

    Did you have any success with the over-weekend testing?

     

    Kind regards,

    Håkon

  • Yes, it seems to be a floating RX pin on the UART3 instance that is the culprit. This issues seems to go away if I remove (using the delete-property tag) the pin in the dts definition. This pin is used for single wire uart, when the unit is docked in a charger. This is however a bit strange because when I undock the unit it disables UART, the RX pin and interrupts using this code:

    nrf_uarte_disable(uarte);
    nrf_uarte_txrx_pins_disconnect(uarte);
    nrf_uarte_enable(uarte);
    
    uart_irq_rx_disable(uart_dev);
    uart_irq_tx_disable(uart_dev);

    This code looks a bit strange as single wire uart is not really supported by the UARTE hardware, it seems like the init code the uarte driver in Zephyr is doing something with the RX pin that is not covered by this disable code.

    Then I enable the unit using this code when the unit docks:

    nrf_uarte_disable(uarte);
    nrf_uarte_txrx_pins_set(uarte, UART_IDLE_PIN, UART_DATA_PIN);
    nrf_uarte_enable(uarte);
    nrf_uarte_task_trigger(uarte, NRF_UARTE_TASK_STARTRX);
    
    uart_irq_tx_disable(uart_dev);
    uart_irq_rx_enable(uart_dev);

    The connections between the unit and the charger is through pogo-pins. The pin should in theory not be floating, and I really don't understand how a floating RX pin can crash the device this bad. I am planning to investigate a bit more, now that the scope of the issue has been focused a lot more.

  • Even Falch-Larsen said:
    Yes, it seems to be a floating RX pin on the UART3 instance that is the culprit. This issues seems to go away if I remove (using the delete-property tag) the pin in the dts definition. This pin is used for single wire uart, when the unit is docked in a charger. This is however a bit strange because when I undock the unit it disables UART, the RX pin and interrupts using this code:

    That is great news.

    The uart-disable routine doesn't stop the RX and wait for the _RXTO event (see here: https://infocenter.nordicsemi.com/topic/ps_nrf9160/uarte.html?cp=2_0_0_5_18_2#concept_uzb_p2m_wr), so this will leave the hardware module still enabled for a small period (to potentially receive one byte)

    Even Falch-Larsen said:
    The connections between the unit and the charger is through pogo-pins. The pin should in theory not be floating, and I really don't understand how a floating RX pin can crash the device this bad. I am planning to investigate a bit more, now that the scope of the issue has been focused a lot more.

    A floating RXD pin should not cause a fault in the firmware, but it can potentially generate a error isr in the uart module and disable reception. I will try to replicate this fault on my end as well.

    Kind regards,

    Håkon

Reply
  • Even Falch-Larsen said:
    Yes, it seems to be a floating RX pin on the UART3 instance that is the culprit. This issues seems to go away if I remove (using the delete-property tag) the pin in the dts definition. This pin is used for single wire uart, when the unit is docked in a charger. This is however a bit strange because when I undock the unit it disables UART, the RX pin and interrupts using this code:

    That is great news.

    The uart-disable routine doesn't stop the RX and wait for the _RXTO event (see here: https://infocenter.nordicsemi.com/topic/ps_nrf9160/uarte.html?cp=2_0_0_5_18_2#concept_uzb_p2m_wr), so this will leave the hardware module still enabled for a small period (to potentially receive one byte)

    Even Falch-Larsen said:
    The connections between the unit and the charger is through pogo-pins. The pin should in theory not be floating, and I really don't understand how a floating RX pin can crash the device this bad. I am planning to investigate a bit more, now that the scope of the issue has been focused a lot more.

    A floating RXD pin should not cause a fault in the firmware, but it can potentially generate a error isr in the uart module and disable reception. I will try to replicate this fault on my end as well.

    Kind regards,

    Håkon

Children
  • Finally figured where the problem most likely originates from. Because we use the UARTE hardware as a single wire UART it seems like we loose track of where the CPU and EasyDMA is accessing memory. And that in some cases we end up in a situation where both the CPU and EasyDMA is accessing the same memory. When reading the fine print in the datasheet it seems to explain all of the issues we are seeing.

    I think this issue can be closed now.

Related