Memory corruption while running RADIO_IRQHandler

Hello,

I'm facing an issue where the whole RAM is filled with 12 bytes repeated.

This issue can happen few minutes or even days after startup. Some of the functions called just before crashing are:
RADIO_IRQHandler
received_frame_notify_and_nesting_allow
nrf_802154_rx_buffers
nrf_802154_received_raw
last_rx_frame_timestamp_get
nrf_802154_timer_coord_timestamp_get
__aeabi_ldivmod

I'm using nrf52840 and nrf5 sdk for thread and zigbee v4.2.0.

In order to debug, we have written the following macro that is adding a variable to each interrupt function containing the name of the interrupt itself:

#define ISR_HANDLER_DEF(A,B) \
  extern void B(void); \
  void A(void)\
  {char intPos[]=#A; \
  B();}

ISR_HANDLER_DEF(extIRQ_Handler02, RADIO_IRQHandler);

//from ses_startup_nrf52840:
  .thumb_func
  .weak   extIRQ_Handler02
extIRQ_Handler02:
  b     .

Here is a file containing last bytes of RAM section corresponding to main stack which contains the string "RADIO_IRQHandler":
2287.corruptedMemNordic_until_RAM_segment_end.bin

Following file contains first bytes of RAM section corresponding to interrupt vector (here it has been overwritten):
4807.corruptedMemNordic_from_RAM_segment_start.bin

File containing last bytes of RAM section from a different test:
1157.corruptedMemNordic_until_RAM_segment_end_test2.bin

VECTORS_IN_RAM is define in our project

__RAM_segment_start__ = 0x20000000
__RAM_segment_end__ = 0x20040000

Here are some addresses from disassembly and .map file (previous binary files refer to these addresses):
0x8D1AF extIRQ_Handler02
0x2001DB8C nrf_radio_driver.a(nrf_802154_timer_coord.nosd.o)
0x4665B RADIO_IRQHandler
0x2001DB20 nrf_radio_driver.a(nrf_802154_core.nosd.o)
0x45279 received_frame_notify_and_nesting_allow
0x2001E55C nrf_802154_rx_buffers
0x4665B nrf_802154_received_raw
0x44C31 last_rx_frame_timestamp_get
0x47401 nrf_802154_timer_coord_timestamp_get
0x090871 __aeabi_ldivmod

If you need .map file I can share it if you make ticket as private.

Do you have any suggestion to identify the issue?

Best regards,

Laura

Update:

as result of further investigation we saw that the issue is reproducible if:

  1. We stop the code in __aeabi_ldivmod. Call stack shows the functions listed above including RADIO_IRQHandler
  2. We go step by step and we enter __int64_udiv
  3. When we are in instruction 0x90770, we set registers 2 and 3 to 0 to have a zero division. As a result Z bit in apsr register is true.
  4. PC jumps to 0x9083A
  5. __aeabi_ldivmod and __int64_udiv and then executed recursively till RAM end

It seams that a zero division is causing the issue.

  

Parents Reply
  • Hi Jørgen,

    please read the last portion of my initial post to find how to reproduce the issue. In particular the section under "update" that I wrote before your fist comment.

    This portion also contains a screenshot with call stack showing that the only function written by us is the debug one attached there. All the others are from Nordic, ZBOSS and ARM. Consider that we don't have access to ZBOSS and ARM library so we don't know much more than this.

    Best regards,

    Laura

Children
  • Hi Laura,

    I did see the suggested method to reproduce the issue, but I'm not sure how useful it would be to reproduce by manually modifying register values inside a function. We might be able to see the issue happening using this method, but the source of this issue is still not know. A stack violation issue can manifest itself in many strange ways, and I still suspect that the root source of the issue could be in the application even if you only see ZBOSS/ARM functions in the call stack when the issue manifests itself in overwriting RAM.

    Having an project that can be used to reproduce the issue without any register modifications would help a lot in finding a solution.

    Best regards,
    Jørgen

  • Hello Jørgen,

    I don't have access to the code or documentation about most of the functions in the call stack list in my first screenshot so I cannot help more in debugging the issue without directly modifying registers. 

    __aeabi_uldivmod checks if you are doing a 0 division but at the end of __aeabi_uldivmod there is no jump instruction so PC wrongly executes next instruction that is actually __aeabi_ldivmod

    I'm not looking for the cause of the 0 division right now, this would be next step. I'd like that, if this happens, the issue won't completely overwrite the entire content of RAM memory.

    To reproduce the issue you can just write:

    __ASM volatile(
        "   mov r2, #0                              \n"
        "   mov r3, #0                              \n"
        "   bl __aeabi_ldivmod                      \n"
        );

    Best regards,

    Laura

  • Hi,

    Can you check which runtime library your project is configured to use?

    I reproduced the behavior you are seeing using the provided assembly code, but after changing the runtime library configuration from "Embedded Studio" to SEGGER, I'm no longer able to reproduce.

    Best regards,
    Jørgen

  • Hello Jørgen,

    I was using embedded studio library. Now that I moved to Segger one, the issue seems solved.

    Thanks for your help

    Best regards,

    Laura

  • Hi,

    Great to hear that the changing the runtime library variant resolved the issue!

    I would also recommend you to contact Segger and have them confirm if the Nordic SES license covers this runtime library, or only the old one. If it is not cover, Segger should be the ones fixing the issue in their runtime library implementation.

    Best regards,
    Jørgen

Related