This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Watchdog Implementation Advice

We are preparing an application to enter production and seeking to harden it to protect against unexpected system lockup.

The application is a datalogger peripheral based on nRF52840 and SDK 14.2 with BLE used to intermittently transfer data to a central. The application code was developed around the CGMS example code from the SDK and has the system has following attributes:

- Buttonless and battery powered (battery not user replaceable)

- Shipped in system_off mode (system will wake when a sensor is triggered, monitored using LPCOMP event)

- Device can be factory reset by writing to a BLE characteristic, which places the system back into system_off mode and again awaits LPCOMP event

- While awake the device increments a counter based on LPCOMP interrupts and writes the counter value to a circular RAM buffer at 60 second intervals

- Between the counter interrupts the system sleeps with power_manage()

- Bootloader is present and firmware can be updated using buttonless DFU

Our plan is to add a watchdog that runs during the CPU sleep with a 90 second timeout, and then feed the watchdog in the handler for the datalogging event - the approach being simply that if the device ceases to log data it must have hung.

However, after reading the documentation for the watchdog and various previous questions on the devzone I have realised our application is at risk of encountering a number of corner cases, particularly in regards to interaction between the watchdog and buttonless DFU, e.g:

https://devzone.nordicsemi.com/f/nordic-q-a/28106/dfu-and-watchdog-timer-wdt-reset

https://devzone.nordicsemi.com/f/nordic-q-a/30677/bootloader-doesn-t-load-application-with-watchdog/121425#121425

From this I understand we will need to make the following changes to the boot loader in order to use the watchdog with our application:

- Remove clock uninit from boot loader code

- Add a timer to the boot loader to kick the dog during DFU

Assuming this is correct I have the following additional questions:

- Are there any further modifications to the boot loader required to ensure compatibility of buttonless DFU and the watchdog? The modifications noted above don't appear to be formally documented and we don't want to ship the product and then find out we missed something!

- Are there any other corner cases we should be aware of for our application, such as the behaviour of the watchdog when the application is placed in system_off?

- What is the best way to check that the soft device tasks are still functioning correctly (i.e. advertising or maintaining BLE connection correctly) such that we can check for this before feeding the dog in our application code?

- Are there any other recommendations in addition to enabling the watchdog for hardening our application for production (e.g. relating to power management, etc).

This is our first time taking a buttonless embedded system into production so if there are any resources/whitepapers (from Nordic or elsewhere) that would help guide us on things we should consider they would be welcomed!

Parents
  • Hi,

    From this I understand we will need to make the following changes to the boot loader in order to use the watchdog with our application:

    - Remove clock uninit from boot loader code

    - Add a timer to the boot loader to kick the dog during DFU

    Assuming this is correct I have the following additional questions:

    Yes, this is correct. It should not be necessary to make any other changes. That said, I would strongly recommend you to port your existing code to SDK 15.2/s140 v6.1.0. The SDK14/s140v5 combination should not be used in production because of limited testing and verification for the 52840. The MBR shipped with s140 v5 does for instance not allow bootloader updates which will prevent DFU of future softdevices (i.e., major updates with breaking API changes). 

    The WDT issues has been addressed in the SDK 15 bootloader. 

    Are there any other corner cases we should be aware of for our application, such as the behaviour of the watchdog when the application is placed in system_off

    Only that the WDT will not run while the chip is in System OFF mode since all clock sources are powered down. WDT will start to run again once the device is woken up by a LPCOMP event. 

    What is the best way to check that the soft device tasks are still functioning correctly (i.e. advertising or maintaining BLE connection correctly) such that we can check for this before feeding the dog in our application code?

    The softdevice will notify the application if an unrecoverable error occurs (SD error handling). It should be safe to assume that the softdevice is functioning correctly as long as this doesn't happen. You may use the radio notification feature if you want to verify that there are periodic radio events, but keep in mind that it may have a slight impact on current consumption as the application will be woken up on radio events that would otherwise have been running in the "background".

    Are there any other recommendations in addition to enabling the watchdog for hardening our application for production (e.g. relating to power management, etc). 

    This is our first time taking a buttonless embedded system into production so if there are any resources/whitepapers (from Nordic or elsewhere) that would help guide us on things we should consider they would be welcomed!

    Not that I can think of, the WDT should be very effective in preventing the FW app from becoming stuck as long as it is properly implemented. You may consider to use more than one RR register and reload them in separate parts of the code to ensure that the program flow is correct. E.g, program can potentially get stuck in a error handler,etc while WDT is still being reloaded by a higher priority interrupt event.  

Reply
  • Hi,

    From this I understand we will need to make the following changes to the boot loader in order to use the watchdog with our application:

    - Remove clock uninit from boot loader code

    - Add a timer to the boot loader to kick the dog during DFU

    Assuming this is correct I have the following additional questions:

    Yes, this is correct. It should not be necessary to make any other changes. That said, I would strongly recommend you to port your existing code to SDK 15.2/s140 v6.1.0. The SDK14/s140v5 combination should not be used in production because of limited testing and verification for the 52840. The MBR shipped with s140 v5 does for instance not allow bootloader updates which will prevent DFU of future softdevices (i.e., major updates with breaking API changes). 

    The WDT issues has been addressed in the SDK 15 bootloader. 

    Are there any other corner cases we should be aware of for our application, such as the behaviour of the watchdog when the application is placed in system_off

    Only that the WDT will not run while the chip is in System OFF mode since all clock sources are powered down. WDT will start to run again once the device is woken up by a LPCOMP event. 

    What is the best way to check that the soft device tasks are still functioning correctly (i.e. advertising or maintaining BLE connection correctly) such that we can check for this before feeding the dog in our application code?

    The softdevice will notify the application if an unrecoverable error occurs (SD error handling). It should be safe to assume that the softdevice is functioning correctly as long as this doesn't happen. You may use the radio notification feature if you want to verify that there are periodic radio events, but keep in mind that it may have a slight impact on current consumption as the application will be woken up on radio events that would otherwise have been running in the "background".

    Are there any other recommendations in addition to enabling the watchdog for hardening our application for production (e.g. relating to power management, etc). 

    This is our first time taking a buttonless embedded system into production so if there are any resources/whitepapers (from Nordic or elsewhere) that would help guide us on things we should consider they would be welcomed!

    Not that I can think of, the WDT should be very effective in preventing the FW app from becoming stuck as long as it is properly implemented. You may consider to use more than one RR register and reload them in separate parts of the code to ensure that the program flow is correct. E.g, program can potentially get stuck in a error handler,etc while WDT is still being reloaded by a higher priority interrupt event.  

Children
  • Hi Vidar - thanks for your help.

    Noted on the recommendation to move to SDK 15.2 for the 52840. Can you confirm if the fix for the peer manager bug identified in SDK 14.2 as described in the below case has been rolled into 15.2 or will we need to re-patch the newer SDK?

    https://devzone.nordicsemi.com/f/nordic-q-a/29926/bug-in-peer-manager-in-sdk-14-2

    Regarding proper implementation of the WDT, I understand the benefit of multiple RR registers when there is more than one predictable code path through the application, but our application spends the majority of its time sleeping to conserve power and waiting for external events to wake it up. Sleep periods may happen for extended periods, e.g. when advertising. I therefore don't believe we can use RR registers directly to monitor the system is working correctly, as we can't rely on events coming in to reset them.

    There are a great many event handlers that link the SDK to our application and act as entry points to the codebase, none of which can be relied upon to be called regularly enough to reload an RR register.

    One potential approach would be to set a flag at the start and end of each event handler in our application code and then have a monitor task (running on a timer) to check the flags and reset the RR register if all ok, but this seems like a complex solution. It also would not catch any issues occurring within the SDK code itself (like the peer manager bug discussed above).

    Do you have a recommended way to ensure coverage of all code paths in an event based system like Nordic have implemented in the SDK? Is there a single place where such a check can be implemented for all SDK/soft device generated events?

  • Unfortunately, this bug was not fixed in SDK 15.2.0, so it need to be re-patched.

    I don't think it's necessary to have WD coverage for all code paths. Maybe reload one register from the data logging event and another one from the main loop before entering sleep. This should ensure that the logging event is triggered periodically, and that it always returns to the main loop afterwards.

    You can use advertisement timeout to wake the application when advertising for extended periods of time. E.g. start advertisement with timeout set to 80 seconds, then on timeout, feed WDT and re-start advertising (remember to reload both RR registers in this event if data logging is not running in "adv. mode"). 

Related