ZBOSS Stack Locking Threads

HW/SW Versions:

nrf52840 Dongle DK

nrf Connect SDK 2.5.2

Issue:

I'm developing a Zigbee device based on the ZBOSS, using an nrf52840 dongle development kit and there seems to be an issue with the ZBOSS stack in my project configuration that I don't know how to debug(or if it's even possible without access to the source code).

Periodically(meaning seconds to hours after the stack is initialized) the device is left in a state where no other tasks are executing, and pausing with a debugger the debugging information shows that the ZBOSS thread is the only one being executed.

It's evident that no other tasks are running since the USB shell no longer responds, periodic logging over RTT in a different thread does not show over a debugger, and a test GPIO task(turning on/off a pin on a OS scheduled task) no longer changes the GPIO value.

During this time ZBOSS osif/trace doesn't produce any logs, and a device reset puts the device back into a regular state.

When sniffing the zigbee network, the issue is more likely to happen during device discovery.

The entire project configuration can be found at https://git.metznet.ca/noah/jamie, both the `prj_release.conf` and `prj_debug.conf` exhibit this behaviour(the difference between the two is that `debug` is built with additional debugging information, and `release` has FOTA enabled + uses MCUboot instead of booting the image directly).

Not sure what other logs/information would be relevant here so let me know and I can attach them. The ZBOSS osif and trace logs always show different outputs when the system is in this failure mode(using synchronous logging over RTT), and pausing with a debugger always shows the ZBOSS thread's stack in different states, then resuming causes the breakpoint on the hardfault handler to be triggered(with no relevant information in the hard fault registers, and a corrupt stack for the ZBOSS thread). The ZBOSS thread seems to be the likely cuplrit, since the stacks of all the other threads show them having yielded, with only the ZBOSS thread having a corrupt stack. Enabling asserts doesn't trigger any additional logs/faults, so it doesn't seem like the information being passed to the stack is incorrect.

Thanks for reading.

Parents
  • Creating a new top-level reply since I've gotten a more clear reproduction method.

    Starting from the zigbee shell sample in nrf connect v2.5.2 with nrf52840dongle_nrf52840 selected as the board:

    1. In `struct zb_device_ctx`, replace `zb_zcl_basic_attrs_t` with `zb_zcl_basic_attrs_ext_t`
    2. Replace `ZB_ZCL_DECLARE_BASIC_ATTRIB_LIST` with `ZB_ZCL_DECLARE_BASIC_ATTRIB_LIST_EXT` and update the macro with all of the pointers to `dev_ctx.basic_attr`
    3. Update `app_clusters_attr_init` to populate the strings with `ZB_ZCL_SET_STRING_VAL` from constant static strings, and set the other basic attributes using constant values
    4. Boot the device in the range of a joinable network.
    5. Wait until the device has completed the interview with the coordinator and the strings are visible on the coordinator(zigbee2mqtt in my case)
    6. Run `bdb factory_reset` in the zigbee shell to schedule a call to `zb_bdb_reset_via_local_action` in the ZBOSS thread
    7. Go to 5

    During the 5-7 loop, step 5 will timeout and put the device into the described failure state where no threads are getting CPU time except the ZBOSS thread, and the ZBOSS thread is not processing any packets. This can be observed by the serial console not responding to any more input, and the network LED no longer showing any activity. Adding a GPIO callback to RTT log shows that hardware interrupts are still able to write to the RTT region of memory.

  • Have there been any updates on this?

    I'm able to consistently reproduce this by calling `bdb device_reset` in the sample after changing it to use etxtended basic attributes. Since this is a blocker for any long-term zigbee device, what alternatives do I have while I await a resolution?

  • Hello,

    Sorry about the delay.

    I have not been able to reproduce the issue so far.

    Here are a few more things you can do:

    1. Double check that the coordinator has opened the network before your device starts to rejoin.

    2. See if the workaround for KRKNWK-12017: Zigbee End Device does not recover from broken rejoin procedure from the known Zigbee issues list fixes your issue.

    3. Share your sniffer log (and network key) in a reply.

    Best regards,

    Maria

Reply Children
No Data
Related