Firmware randomly hang for minutes or hours

Hello all,

First, thank you for the support and community here, which is very helpful.

I have been developing a device using the XIAO nRF52840 device. (link: https://wiki.seeedstudio.com/XIAO_BLE/ )

The development has been going great using our custom PCB soldered to the XIAO BLE. We are using the latest nRF Connect SDK 2.9.1 (same problem on 2.9.0 and 2.8.0).
We implemented all functionalities successfully: BLE pairing, BLE advertising with accept list, battery reading, charging state, etc.

We have only one major bug:
Sometimes, the board stops working. Randomly and on the field. Here is a list of facts to explain clearly what is going on:

- The UF2 bootloader works properly. (from adafruit).
- The UF2 bootloader launches our firmware but our firmware does not start: no logs, no LED, nothing. A ghost.
- This is the bug: Zephyr OS never properly start. No logs through USB, no advertising as expected, no LED, nothing.
- 10 seconds later, 10 minutes later or a few hours later (very random), the firmware starts, the LED turns on and BLE advertising starts.
- Not all boards are affected the same. Some boards have the bug very often, other boards never got the bug yet.


- Setting the 32k clock source to RC instead of XTAL seems to greatly mitigate the problem but does not fix it. (CLOCK_CONTROL_NRF_K32SRC_RC instead of CLOCK_CONTROL_NRF_K32SRC_XTAL).
- Heating the PCB with our own hands or with a soldering iron seems to trigger the bug more often.
- Otherwise the bug triggers very randomly, let's say once every 5 days. Which makes it hard to debug.


- The bug continues to happen with the Zephyr OS Blink sample. The sample runs fine, except when the bug happens: the sample stops working. Thus, I conclude my own code on top of Zephyr OS may not be the source of the problem.


- The problem really persists across physical reset (through the reset button) and power loss. I can unplug the board for five minutes, then plug it back in, and the bug keeps happening. Then, a few minutes later (or a few hours), the board starts back up.


- Our custom PCB is soldered onto the XIAO_BLE but isn't much: just wires to two connectors, a few capacitors for buttons, some buttons, and a battery. Nothing more.


- The bug happens on battery, on USB, and on battery + USB.


- The bug is very rare, but very present "on the field". So, I wasn't able yet to catch the bug through SWD and a Segger debugger, which I have. Maybe the debugging will show me the code is blocked somewhere very obvious.

The first thing I can conclude is: the UF2 bootloader works but not Zephyr OS. Then, the chip and the board still work properly. Therefore, Zephyr OS must be starting properly but hang somewhere in its initialization. (reminder: the blink sample demonstrates the same behavior of hang).

I have read all of the internet on any subjects that could help me. I found some ideas :


- The 32k clock isn't working properly, which is mandatory for all the scheduling. Which means maybe Zephyr is waiting for the clock to tick. But unprobable because the bug happens with both clocks (XTAL and RC). (source: many places on the internet and this forum)
- The Devicetree currently holds the following: zephyr,entropy = &cryptocell;. Maybe there is an entropy starvation of some sort which explains the late starts. I have yet to test the product on the field with the following potential fix: zephyr,entropy = &rng;. (source: devzone.nordicsemi.com/.../random-zephyr-hangs-due-to-rng-entropy
- The watchdog that I configured goes mad, even through restart. Unprobable because the bug survives across power loss.

In any case, here is my prj.conf.

CONFIG_BT_DIS_FW_REV_STR="1.0.1+0"

# I use this when I tried using RC instead of Crystal for 32k Clock
# CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y
# CONFIG_CLOCK_CONTROL_NRF_K32SRC_500PPM=y
# CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC_CALIBRATION=y
# CONFIG_CLOCK_CONTROL_NRF_CALIBRATION_LF_ALWAYS_ON=y

# I use this when connecting through SWD to view logs
# CONFIG_USE_SEGGER_RTT=y
# CONFIG_RTT_CONSOLE=y

# USB
CONFIG_USB_DEVICE_REMOTE_WAKEUP=n
CONFIG_USB_DEVICE_MANUFACTURER="XX"
CONFIG_USB_DEVICE_PRODUCT="XX"

# GPIOs
CONFIG_GPIO=y
CONFIG_INPUT=y
CONFIG_PWM=y
CONFIG_LED=y
CONFIG_LED_PWM=y
CONFIG_ADC=y

# Floating point support
CONFIG_FP16=y
CONFIG_CBPRINTF_FP_SUPPORT=y

# Allow to read the hardware unique ID in the BLE MAC Address format.
CONFIG_HW_ID_LIBRARY=y
CONFIG_HW_ID_LIBRARY_SOURCE_DEVICE_ID=y

# Watchdog
CONFIG_WATCHDOG=y

# BLE Transmit power
CONFIG_BT_CTLR_TX_PWR_PLUS_4=y

# Power management and poweroff
CONFIG_REBOOT=y
CONFIG_POWEROFF=y
CONFIG_PM_DEVICE=y
CONFIG_NRFX_QSPI=y

# Enable logging
CONFIG_LOG=y
CONFIG_LOG_BUFFER_SIZE=2048
CONFIG_CONSOLE=y
CONFIG_SERIAL=y

# Enable persistent settings
CONFIG_FLASH=y
CONFIG_NVS=y
CONFIG_NVS_LOG_LEVEL_WRN=y
CONFIG_FLASH_MAP=y
CONFIG_SETTINGS=y
CONFIG_BT_SETTINGS=y
CONFIG_SETTINGS_RUNTIME=y

# Enable BLE
CONFIG_BT=y
CONFIG_BT_PRIVACY=y
CONFIG_BT_PERIPHERAL=y
CONFIG_BT_DEVICE_NAME="XX"
CONFIG_BT_LOG_LEVEL_WRN=y
CONFIG_BT_SMP=y
CONFIG_BT_KEYS_OVERWRITE_OLDEST=y
CONFIG_BT_SMP_ALLOW_UNAUTH_OVERWRITE=y
CONFIG_BT_FILTER_ACCEPT_LIST=y
CONFIG_BT_MAX_PAIRED=5

# Enable Device Information Service (DIS)
CONFIG_BT_DIS=y
CONFIG_BT_DIS_SETTINGS=y
CONFIG_BT_DIS_STR_MAX=21
CONFIG_BT_DIS_PNP=n
CONFIG_BT_DIS_SERIAL_NUMBER=y
CONFIG_BT_DIS_MODEL="XX"
CONFIG_BT_DIS_MANUF="XX"
CONFIG_BT_DIS_FW_REV=y
CONFIG_BT_DIS_HW_REV=y
CONFIG_BT_DIS_HW_REV_STR="1"

# Enable Battery Service (BAS)
CONFIG_BT_BAS=y
CONFIG_BT_BAS_BLS=y
CONFIG_BT_BAS_BLS_BATTERY_LEVEL_PRESENT=y
CONFIG_BT_BAS_BLS_ADDITIONAL_STATUS_PRESENT=y

# Set our preferred BLE connection parameters
# Minimum preferred connection interval 7.5ms
# Maximum preferred connection interval 15.0ms
# 30 connections interval of preferred latency
# Preferred timeout is 2 seconds
CONFIG_BT_GAP_AUTO_UPDATE_CONN_PARAMS=y
CONFIG_BT_PERIPHERAL_PREF_MIN_INT=6   
CONFIG_BT_PERIPHERAL_PREF_MAX_INT=12  
CONFIG_BT_PERIPHERAL_PREF_LATENCY=30  
CONFIG_BT_PERIPHERAL_PREF_TIMEOUT=200 

Any help would be very welcome. I am out of ideas.

Thank you in advance for your answers!

Martin

Related