Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs

nRF5 SDK for Mesh v4.1.0 Flash Manager Corrupt Page Due to Power Cycle

Hi,

I have been trying to chase down an issue with the flash manager occasionally raising an assertion in mesh_config_backend_init() after a power cycle of the nRF52840 we are using in our product. Occasionally, when we re-provision the radio and then power cycle the nrf52840 from the external host processor, we encounter the issue detailed below. We have deployed past versions of the nRF5 SDK for Mesh (~2.0.0 I believe) and never saw this issue, but we are worried it is a latent problem we simply haven't encountered. It is certainly a critical issue for the current product, as it effectively bricks the nrf52840 in the field. We are definitely reprovisioning more often in the current product, so that may be contributing to this being more likely to occur.

The flash manager approach to writing and sealing entries seems fairly reasonable, so what I'm seeing should never be possible given my understanding of the flash manager implementation. See memory dump for the flash page in question below. Based on the values which still exist in flash, this page is most likely being used to store the subnet key(s). This page has had it's first header replaced with all zeroes. This is an invalid handle, but because the size of the handle is zero this causes an assertion to be raised in flash_manager_internal.h :: get_next_entry(). Even if it didn't, this would ultimately cause the code to be stuck in an infinite loop by adding zero bytes to the current pointer and the caller flash_manager.c :: get_invalid_bytes() simply getting the same invalid entry over and over and over again without ever making progress.

I understand that previous entries can be invalidated by the flash manager by writing 0x0000 to their handle, causing the next instance of the handle to be considered the valid one. However, writing zero to the size of the handle clearly shouldn't be happening. My only guess is that there's some hardware limitation where a flash write being executed at power off causes the entire word to be written as zero rather than only a half word. Is this a known issue? Is there a workaround/fix, such as changing the handles to be 8 bytes in size so invalidating the handle cannot invalidate the size?

  • Hi,

    What kind of assert are you getting from mesh_config_backend_init()?

    We are definitely reprovisioning more often in the current product, so that may be contributing to this being more likely to occur.

    How often are you reprovisioning?

  • Specifically we are seeing the assertion in `flash_manager_internal.h :: get_next_entry()`

    The device roams between different meshes based on its location. I'd say the design goal is once a day over a ten year life span, but realistically it should only happen on a weekly basis.

    At this point in testing to reproduce the error, I reprovision the device and then power cycle the radio by resetting the host (and therefore cutting power to the nRF52840) within a second of the configuration/provisioning being sent to the host. It is not reproduceable every time, but happens at least once every 50 cycles, closer to once every ten. I assume this has to do with the queued flash manager write back action mechanism, but I'm not certain.

    I have also verified and seen this effect over dozens of nRF52840s, but I have not been debugging each time and cannot point to this particular flash page as part of the issue.

    I should clarify that we are using the nRF52840 as a mesh radio with the nRF5 SDK for Mesh Serial Interface module connected to a host processor over UART.

  • Hi,

    After having our developer having a look at the issue it seems like something similar was reported for v4.1.0. The issue was fixed in v4.2.0. It is highly recommended to use our latest version(v5.0.0) as a lot of issues have been fix since version v4.1.0

  • Hi Mttrinh,

    Could you please mention what exactly was the issue and what was the fix put?

Related