This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Best practices for robust fault recovery with nRF51822?

I will be deploying my nRF51822-powered devices into the field and the end user will not be able to have any physical access to, for example, power cycle the board. I need end users to be able to connect via BLE to the peripheral (S110 stack 5.2.1 running on the nRF51822), receive some data, and disconnect, after which the device must begin advertising again. I have this all working, but am concerned about the robustness of a long-term, multi-device deployment where I can not access the devices for many months.

Are there any recommended best practices from experienced developers on how to configure the device to recover / reboot from any issues? It is OK if the device resets. It is NOT OK if the device is left in a state where it is no longer advertising, stuck in an infinite loop, etc.

I am currently considering:

  1. A first watchdog timer, reset every time that sd_app_event_wait() returns. (Should I look for only certain return values?)
  2. After connection is established, start a second watchdog timer, reset only by a field that is set by the remote client? -- in case a fault within the S110 or BLE stacks still return valid error codes but not truly in communication with the remote.
  3. HardFault_Handler() -- do I need to do anything here?
  4. NMI_Handler() -- do I need to do anything here?

I hope these responses will be useful to anyone needing this kind of fault tolerance and recovery in their own designs. Many thanks in advance for contributing your experience and advice!

Related