Mesh Config Reset

I know they are older libraries, but I'm still using the Mesh v5.0.0 SDK and nRF5 v17.1.0 for a product that uses a nRF52832 as it's radio. I've been able to implement my own models and have everything working, but have an occasional bug.

The symptom is reboot looping, and taking an image of a "bad" device and inserting a slightly modified application image that is still compiled for release but with some debugging enabled and a modified bootloader that skips CRC validation. With that I'm able to catch the application entering my implementation of the 'app_error_fault_handler' and the source of the problem appears to be an assert being thrown while setting up the flash manager in mesh_config_init.

First, the device is plugged into the wall and runs the risk of being unplugged at any time, so I suspect this may be data corruption due to interrupted flash write operations. Is this a logical assessment, or is it more likely that I have some other bug causing the corruption? We have many of these devices running in the field without error, so if it is a bug in the code it is very rarely expressed which would be confusing.

Second, how can I recover dynamically?

In my modified application image I added lines in the app_error_fault_handler to disable the SoftDevice, then issue nrf_nvmc_page_erase() commands to every data page being used in the buggy image (between Application and Bootloader). This resolves the issue and the device boots normally as a brand new device and it is able to rejoin the network, but required that I inspect the memory of the buggy image and target the data pages explicitly in my recovery logic.

How can I get a list of all pages that mesh_config is using, or the address of the end of the main application data so I can kill every page between it and the bootloader? I did try using the FOR_EACH_FILE iterator in mesh_config.c, but that didn't work (I'm assuming because this error occurs during mesh_config_init and it isn't set up yet).

  • By adding a ProgramSection named ".app_end_addr" to the FlashPlacement.xml file between the last reference used by application and my bootloader reservation entry, I'm able to retrieve the address of that section identifier and round up to the next page.

    volatile uint32_t m_app_end_addr  __attribute__ ((section(".app_end_addr")));
    
    int main(void)
    {
        // Round UP to nearest page boundary
        uint32_t first_data_page = ((uint32_t)&m_app_end_addr + 0xFFF) & ~0xFFF;
        NRF_LOG_INFO("Addr of first page after main Application: 0x%08X", first_data_page);
    }

    I can then loop through and erase every page between this boundary and the start of the bootloader to perform the hard reset I was looking for.

    This appears to work, but I would love some feedback if this is a bad idea or I'm missing some hidden pitfall with this approach.

    Thanks!

  • I think we are having the same issue, I am also using mesh 5.0.0 and am getting nodes resetting themselves. It seems correlated to power failures.

    What is your current strategy for saving message cache to flash? Continuous, power down, or never? 

    I think it happens more frequently on networks with heavy traffic using the continuous flash savings strategy. 

    It would be nice to get a solution to this that doesn't kick nodes out of a mesh network as this can be a huge issue on large networks.

Related