ZBOSS Stack Locking Threads

HW/SW Versions:

nrf52840 Dongle DK

nrf Connect SDK 2.5.2

Issue:

I'm developing a Zigbee device based on the ZBOSS, using an nrf52840 dongle development kit and there seems to be an issue with the ZBOSS stack in my project configuration that I don't know how to debug(or if it's even possible without access to the source code).

Periodically(meaning seconds to hours after the stack is initialized) the device is left in a state where no other tasks are executing, and pausing with a debugger the debugging information shows that the ZBOSS thread is the only one being executed.

It's evident that no other tasks are running since the USB shell no longer responds, periodic logging over RTT in a different thread does not show over a debugger, and a test GPIO task(turning on/off a pin on a OS scheduled task) no longer changes the GPIO value.

During this time ZBOSS osif/trace doesn't produce any logs, and a device reset puts the device back into a regular state.

When sniffing the zigbee network, the issue is more likely to happen during device discovery.

The entire project configuration can be found at https://git.metznet.ca/noah/jamie, both the `prj_release.conf` and `prj_debug.conf` exhibit this behaviour(the difference between the two is that `debug` is built with additional debugging information, and `release` has FOTA enabled + uses MCUboot instead of booting the image directly).

Not sure what other logs/information would be relevant here so let me know and I can attach them. The ZBOSS osif and trace logs always show different outputs when the system is in this failure mode(using synchronous logging over RTT), and pausing with a debugger always shows the ZBOSS thread's stack in different states, then resuming causes the breakpoint on the hardfault handler to be triggered(with no relevant information in the hard fault registers, and a corrupt stack for the ZBOSS thread). The ZBOSS thread seems to be the likely cuplrit, since the stacks of all the other threads show them having yielded, with only the ZBOSS thread having a corrupt stack. Enabling asserts doesn't trigger any additional logs/faults, so it doesn't seem like the information being passed to the stack is incorrect.

Thanks for reading.

Parents Reply Children
  • Thank you for the clarification. There were two reasons for needing this to be clear:

    • Only one of the Zigbee samples in NCS supports the nRF52840 Dongle out of the box
    • I saw references to more than one configurable switch in your code. The Dongle has one user configurable switch and a reset button. See the Connections and IOs part of the nRF52840 Dongle overview.

    I don't have a Dongle at the moment. I will request one and see if I can reproduce the behaviour. You'll hear from me before the end of the week.

    Be sure to continue to update us if you discover anything on your own.

    Best regards,

    Maria

  • Yea the shell sample in NCS is the only one I could find that supported the dongle, so I ended up using it as an example to build off of for this application(adding four zigbee endpoints for OnOff/OnOffSwitchConfig and an endpoint for FOTA in the release build).

    The sw0-sw3 switches are defined in `boards/nrf52840dongle_nrf52840.overlay` along with the apps LED and reset button assignments(since the intention is to move away from the dongle and to a custom nrf52840 board after proving the prototype). I'm intentionally not using any pre-existing pin assignments so that the code can be moved to a custom board in the future.

    I was able to reproduce the same state on an "adafruit feather nrf52840"(with modifications to the dts overlay for pin assignments) after causing the device to be interviewed a few times(by calling `zb_bdb_reset_via_local_action` after a successful interview), so you should be able to reproduce the issue on any nrf52840-based development kit.

    I was also able to reproduce the same state on the nrf52840 Dongle using the zigbee shell example in NCS by changing the `zb_zcl_basic_attrs_t` to a `zb_zcl_basic_attrs_ext_t` and registering/populating the fields accordingly(upon interview there was a chance for the shell to stop taking input as the ZBOSS thread took all available CPU time without completing the interview).


    So to summarize(since I've spread info across a few posts now):


    - The switches in the code I've shared are defined in the board overlay file


    - The issue is not limited to a single SoC or bootloader since no booatloader, MCUBoot, and the NRF secure DFU bootloader(that the dongle is sold with) all have been able to reproduce this issue across two difference boards(dongle and adafruit feather)

    - The issue is reproducible in the zigbee shell example when switching the basic_attrs to their extended versions, so for a minimal example that would exemplify it better.

    - The issue is reproducible in the short-term case when the device is interviewed(there's a chance the interview completes successfully, and a chance the ZBOSS thread locks the OS), in the long term case I don't know what caused the failure but it could be similar read requests to the INFO endpoint with extended attributes.

Related