FDS - data corruption during an interrupted swap

Hi,

I have found a bug in the fds module (Flash Data Storage), which lead to losing "installed pages". If it was a swap page, fds module will be bricked.

Tested on SDK 14.2.0, but it is present in 15.2.0. Hardware is not relevant, but I tested it with PCA10040 (nRF52832).

Pages used by fds are taged with 2 magic words. Please see code below:

// Tags a page as swap, i.e., reserved for GC.
static ret_code_t page_tag_write_swap(void)
{
    // The tag needs to be statically allocated since it is not buffered by fstorage.
    static uint32_t const page_tag_swap[] = {FDS_PAGE_TAG_MAGIC, FDS_PAGE_TAG_SWAP};
    return nrf_fstorage_write(&m_fs, (uint32_t)m_swap_page.p_addr, page_tag_swap, FDS_PAGE_TAG_SIZE * sizeof(uint32_t), NULL);
}


// Tags a page as data, i.e, ready for storage.
static ret_code_t page_tag_write_data(uint32_t const * const p_page_addr)
{
    // The tag needs to be statically allocated since it is not buffered by fstorage.
    static uint32_t const page_tag_data[] = {FDS_PAGE_TAG_MAGIC, FDS_PAGE_TAG_DATA};
    return nrf_fstorage_write(&m_fs, (uint32_t)p_page_addr, page_tag_data, FDS_PAGE_TAG_SIZE * sizeof(uint32_t), NULL);
}

The problem occures, when this process is interrupted (with reset). It may result in writing only first word:

#define FDS_PAGE_TAG_MAGIC      (0xDEADC0DE)

After reset, fds is checking this magic tag to determine page type. It uses these functions:

// Reads a page tag, and determines if the page is used to store data or as swap.
static fds_page_type_t page_identify(uint32_t const * const p_page_addr)
{
    if (   (p_page_addr == NULL)    // Should never happen.
        || (p_page_addr[FDS_PAGE_TAG_WORD_0] != FDS_PAGE_TAG_MAGIC))
    {
        return FDS_PAGE_UNDEFINED;
    }

    switch (p_page_addr[FDS_PAGE_TAG_WORD_1])
    {
        case FDS_PAGE_TAG_SWAP:
            return FDS_PAGE_SWAP;

        case FDS_PAGE_TAG_DATA:
            return FDS_PAGE_DATA;

        default:
            return FDS_PAGE_UNDEFINED;
    }
}


static bool page_is_erased(uint32_t const * const p_page_addr)
{
    for (uint32_t i = 0; i < FDS_PAGE_SIZE; i++)
    {
        if (*(p_page_addr + i) != FDS_ERASED_WORD)
        {
            return false;
        }
    }

    return true;
}

This fuction will classify this page as FDS_PAGE_UNDEFINED and unerased.

I would like to propose a change in page_is_erased() function.

static bool page_is_erased(uint32_t const * const p_page_addr)
{
    if ((p_page_addr[FDS_PAGE_TAG_WORD_0] != FDS_ERASED_WORD) &&
        (p_page_addr[FDS_PAGE_TAG_WORD_0] != FDS_PAGE_TAG_MAGIC))
    {
        return false;
    }

    for (uint32_t i = FDS_PAGE_TAG_WORD_1; i < FDS_PAGE_SIZE; i++)
    {
        if (*(p_page_addr + i) != FDS_ERASED_WORD)
        {
            return false;
        }
    }

    return true;
}

Thanks to that, module will correctly classify this page as FDS_PAGE_ERASED.

  • Hi,

    Thank you for the report! Yes, this looks like a bug. I can confirm that you will get NO_SWAP in the way that you describe, if FDS_PAGE_TAG_MAGIC is present but neither of FDS_PAGE_TAG_SWAP nor FDS_PAGE_TAG_DATA are.

    I have reported the issue to the SDK team.

    From what I can tell your proposed fix should solve the issue, although with that change the name of the function page_is_erased() becomes misleading. (Since the first word of the page is not erased, but written to!) That means if we fix it this way then we might end up introducing other errors at a later point in time, if we rework related portions of code. That means we may choose to solve this in another way than what you propose.

    Regards,
    Terje

  • Thanks for the feedback. I had similar doubts regarding the function name. Anyway, this page from the file system is actually erased. I treat this code as a workaround, but for sure you can find a cleaner solution.

  • Hello:

    We seem to have a similar problem. This problem has happened on few different boards now. we are using the nRF52832 and SDK 15. The issue seems to happen when the battery gets low and we are NOT able to reproduce it. The issue is:

    1) as the battery starts to get low or close to dead, the device powered down after few seconds which is expected. 

    2) when we put in a new battery, the board still will not power up. we are assuming the flash may have gotten corrupted(maybe by a brownout)  since we do write flash every minute. 

    3) I compared the flash from the failed device with a good device and the only difference appear on the 3 assigned data flash pages at addresses 0x75000, 0x76000, and 0x77000. The bootloader sits at 0x78000. The difference appears on page 0x76000 where the good device is defined as a swap page(FDS_PAGE_TAG_SWAP) and the bad board does NOT define any of its 3 pages as swap. how can this happen?

    4) on the "corrupted" FW, when I change the type from FDS_PAGE_TAG_DATA to FDS_PAGE_TAG_SWAP on page 0x76000, the board boots up and function as expected. 

    I have attached the hex files for review.

    I have a very strong suspicion these problems are related. Your time and effort would be greatly appreciated in helping us nail down a quick solution and maybe some instructions on how to reproduce the problem. I have attached the hex files for review.

    memory dump from working board.hex  

    memory dump from failed board 3_1_23.hex

  • Hi,

    I have observed 3 different issues with fds module:

    Two of them are fixed and I recommend you to review and apply my patches.

    The last one I have reported to Nordic as a Private Ticket (12 Mar 2019) and I did not found the root cause. Please see https://drive.google.com/drive/folders/1y_KOyIhVw9d-ZAAIc8SSa8k9vUOobDVz?usp=sharing This is dump from my reproduction of this issue.

  • Hi Wojciech:

    Thanks for your reply! did you get a response on how to fix my issue(the 3rd issue) you sent via a private ticket? 

    Thanks

Related