We are preparing an application to enter production and seeking to harden it to protect against unexpected system lockup.
The application is a datalogger peripheral based on nRF52840 and SDK 14.2 with BLE used to intermittently transfer data to a central. The application code was developed around the CGMS example code from the SDK and has the system has following attributes:
- Buttonless and battery powered (battery not user replaceable)
- Shipped in system_off mode (system will wake when a sensor is triggered, monitored using LPCOMP event)
- Device can be factory reset by writing to a BLE characteristic, which places the system back into system_off mode and again awaits LPCOMP event
- While awake the device increments a counter based on LPCOMP interrupts and writes the counter value to a circular RAM buffer at 60 second intervals
- Between the counter interrupts the system sleeps with power_manage()
- Bootloader is present and firmware can be updated using buttonless DFU
Our plan is to add a watchdog that runs during the CPU sleep with a 90 second timeout, and then feed the watchdog in the handler for the datalogging event - the approach being simply that if the device ceases to log data it must have hung.
However, after reading the documentation for the watchdog and various previous questions on the devzone I have realised our application is at risk of encountering a number of corner cases, particularly in regards to interaction between the watchdog and buttonless DFU, e.g:
https://devzone.nordicsemi.com/f/nordic-q-a/28106/dfu-and-watchdog-timer-wdt-reset
From this I understand we will need to make the following changes to the boot loader in order to use the watchdog with our application:
- Remove clock uninit from boot loader code
- Add a timer to the boot loader to kick the dog during DFU
Assuming this is correct I have the following additional questions:
- Are there any further modifications to the boot loader required to ensure compatibility of buttonless DFU and the watchdog? The modifications noted above don't appear to be formally documented and we don't want to ship the product and then find out we missed something!
- Are there any other corner cases we should be aware of for our application, such as the behaviour of the watchdog when the application is placed in system_off?
- What is the best way to check that the soft device tasks are still functioning correctly (i.e. advertising or maintaining BLE connection correctly) such that we can check for this before feeding the dog in our application code?
- Are there any other recommendations in addition to enabling the watchdog for hardening our application for production (e.g. relating to power management, etc).
This is our first time taking a buttonless embedded system into production so if there are any resources/whitepapers (from Nordic or elsewhere) that would help guide us on things we should consider they would be welcomed!