I know they are older libraries, but I'm still using the Mesh v5.0.0 SDK and nRF5 v17.1.0 for a product that uses a nRF52832 as it's radio. I've been able to implement my own models and have everything working, but have an occasional bug.
The symptom is reboot looping, and taking an image of a "bad" device and inserting a slightly modified application image that is still compiled for release but with some debugging enabled and a modified bootloader that skips CRC validation. With that I'm able to catch the application entering my implementation of the 'app_error_fault_handler' and the source of the problem appears to be an assert being thrown while setting up the flash manager in mesh_config_init.
First, the device is plugged into the wall and runs the risk of being unplugged at any time, so I suspect this may be data corruption due to interrupted flash write operations. Is this a logical assessment, or is it more likely that I have some other bug causing the corruption? We have many of these devices running in the field without error, so if it is a bug in the code it is very rarely expressed which would be confusing.
Second, how can I recover dynamically?
In my modified application image I added lines in the app_error_fault_handler to disable the SoftDevice, then issue nrf_nvmc_page_erase() commands to every data page being used in the buggy image (between Application and Bootloader). This resolves the issue and the device boots normally as a brand new device and it is able to rejoin the network, but required that I inspect the memory of the buggy image and target the data pages explicitly in my recovery logic.
How can I get a list of all pages that mesh_config is using, or the address of the end of the main application data so I can kill every page between it and the bootloader? I did try using the FOR_EACH_FILE iterator in mesh_config.c, but that didn't work (I'm assuming because this error occurs during mesh_config_init and it isn't set up yet).