This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nRF52832 with SDK v11.0.0 -- System Randomly Freeze and Doesn't Recover via WDog

Hello,

Currently, we are working on a product which leverages the following:

NRF52832 (64KB RAM and 512 KB Flash)

SDK v11.0.0

Soft Device 132

It has been reported out in the field that our product is randomly "freezing". After weeks of FW debugging/testing/analysis, our team has yet to root cause the issue.

Why is the watchdog (WDog) not overriding the system to reset it? If the CPU has a hard-fault, shouldn't the WDog priority be higher than a CPU hard-fault? Our hard-fault handlers have been updated to reset the system, however, the system remains stuck. The only way that we're able to recover the system is if we pull the battery out and plug it back in. Essentially, we need to make sure that the WDog is not being blocked by any other executing instructions of higher priority. Please advise on what to do. Is this a known issue? Are there other ways that we can recover the system from an indeterminant state besides a WDog reset or hard-fault handler?

Please note, our product does not have a hardware reset pin.

Parents

0 bjorn-spockeli over 5 years ago

HI Sami,

Have you been able to determine if the Hardfaults occur at the same location? Or are they occuring randomly ? We do have a nRF5 SDK v11.0.0: HardFault handling library that is very useful when it comes do debugging these issues.

Are you enabling the WDT upon entering main()? Do you have a bootloader present on the nRF52832? Is your device entering System OFF?

The WDT in the nRF52 Series does rely on an interrupt to reset the SoC , if the WDT times out then the system will be reset, regardless of the state of the CPU. The WDT is decoupled from the CPU to avoid the scenario that are describing.

Best regards

Bjørn
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bjorn-spockeli over 5 years ago in reply to bjorn-spockeli

Also which MDK version are you using in your code? SDK v11.0.0 uses v8.5.0, so if you have not updated the MDK to v8.9.0, then I strongly recommend adding the workaround for the 108 errata.

[108] RAM: RAM content cannot be trusted upon waking up from System ON Idle or System OFF mode
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Sami over 5 years ago in reply to bjorn-spockeli

We have not been able to reproduce the issue in-house, so we don't know if a hard-fault is occurring, let alone occurring in the same location.

-Yes, we are enabling the WDT in main(), before we enter the executive task ("scheduler loop").

-Yes, we have a bootloader present on the nRF52832.

-Yes, the device enters System OFF once and once only, while it is sitting on the shelf. Once transitioned to System ON mode, it will never enter System OFF mode again.

- We are using MDK v8.5.0, and we have not implemented the work-around. I became aware of Errata 108 last week, but wasn't sure how it was affecting our system, if at all. It sounds like RAM content can be deemed indeterminant if System OFF mode is ever entered and then woken up without the work-around?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bjorn-spockeli over 5 years ago in reply to Sami

Sami said:
- We are using MDK v8.5.0, and we have not implemented the work-around. I became aware of Errata 108 last week, but wasn't sure how it was affecting our system, if at all. It sounds like RAM content can be deemed indeterminant if System OFF mode is ever entered and then woken up without the work-around?

Errata 108 may occur when the system is waking up from System ON idle as well, not just System OFF. So everytime you wakeup from System ON idle, there is a chance that RAM is not retained correctly, which may lead to random hardfaults if values in the stack is modified to a value outside the valid memory range.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Sami over 5 years ago in reply to bjorn-spockeli

So I could conceivably run a test where I loop the system between "System ON idle" and system active in order to induce this hard fault? Could you provide context as to how I could induce this issue ?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bjorn-spockeli over 5 years ago in reply to Sami

Hi Sami,
the location of the bits that flip may vary from chip to chip, but the location seems to be fixed for every chip that do fail. Only a few bits (e.g. 1-5) spread across RAM.

The way we test the issue is that we initialize the RAM blocks not used by the stack to all 0's and then enter System ON Idle and wake up from a timer. We then examine the RAM blocks after wake-up and check if any bits have been flipped.

Best regards

Bjørn
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 bjorn-spockeli over 5 years ago in reply to Sami

Hi Sami,
the location of the bits that flip may vary from chip to chip, but the location seems to be fixed for every chip that do fail. Only a few bits (e.g. 1-5) spread across RAM.

The way we test the issue is that we initialize the RAM blocks not used by the stack to all 0's and then enter System ON Idle and wake up from a timer. We then examine the RAM blocks after wake-up and check if any bits have been flipped.

Best regards

Bjørn
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Sami over 5 years ago in reply to bjorn-spockeli

I created an external array:

uint32_t errata108Arr[10] = {0};

so that it lives in the .bss section (i.e., not the stack). Then, I put the system into System OFF mode, waited a few seconds, and then woke the system back up (System ON mode). The array still preserved its contents. Do I need to check all memory blocks outside of the stack in accordance with what you said: "the location of the bits that flip may vary from chip to chip."
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bjorn-spockeli over 5 years ago in reply to Sami

Hi Sami,

Yes, you should create an array that covers as much of RAM as possible.

I have attached the code we have been using internally to check if devices are affected.

errata_108_test_code.zip

Best regards

Bjørn
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Sami over 5 years ago in reply to bjorn-spockeli

Thank you. If the bug, i.e., errata 108, does occur, could it lead to a "screen freeze"/"blank screen" that does not recover via WDog timeout reset? Also, are interrupts also frozen?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bjorn-spockeli over 5 years ago in reply to Sami
HI Sami Balbaky,

if a RAM corruption occurs due to Errata 108 issue, then this can cause a HardFault if the CPU fetches a corrupt instruction that points to a memory location outside the nRF52832s memory area. and the HardFault exception will trigger HardFault_Handler in the application. The HardFault exception is always enabled and has a fixed priority, -1 (higher than other interrupts and exceptions), so while the CPU is in the Hardfault handler it will prevent other interrupts with lower priority from being serviced.

I took a closer look at default HardFault handler implementation in the system_nrf52.c file and this will just loop indefinitely. The HardFault_Handler in the system_nrf52.c file is a weak implementation

.weak HardFault_Handler HardFault_Handler: b .

Which basically is equivalent to

HardFault_Handler: b HardFault_Handler

So if you want the Hardfault exception to trigger a reset then you need to override the weak implementation and issue a SoftReset,i.e. NVIC_SystemReset().

However, the WDT will reset the device if it is running when the HardFault occurs. As discussed previously the WDT is independent of the CPU, it will trigger a reset regardless of the CPU state. When the HardFault occurs the CPU is basically just looping the HardFault handler.

Best regards

Bjørn
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel