ZBOSS Stack Locking Threads

HW/SW Versions:

nrf52840 Dongle DK

nrf Connect SDK 2.5.2

Issue:

I'm developing a Zigbee device based on the ZBOSS, using an nrf52840 dongle development kit and there seems to be an issue with the ZBOSS stack in my project configuration that I don't know how to debug(or if it's even possible without access to the source code).

Periodically(meaning seconds to hours after the stack is initialized) the device is left in a state where no other tasks are executing, and pausing with a debugger the debugging information shows that the ZBOSS thread is the only one being executed.

It's evident that no other tasks are running since the USB shell no longer responds, periodic logging over RTT in a different thread does not show over a debugger, and a test GPIO task(turning on/off a pin on a OS scheduled task) no longer changes the GPIO value.

During this time ZBOSS osif/trace doesn't produce any logs, and a device reset puts the device back into a regular state.

When sniffing the zigbee network, the issue is more likely to happen during device discovery.

The entire project configuration can be found at https://git.metznet.ca/noah/jamie, both the `prj_release.conf` and `prj_debug.conf` exhibit this behaviour(the difference between the two is that `debug` is built with additional debugging information, and `release` has FOTA enabled + uses MCUboot instead of booting the image directly).

Not sure what other logs/information would be relevant here so let me know and I can attach them. The ZBOSS osif and trace logs always show different outputs when the system is in this failure mode(using synchronous logging over RTT), and pausing with a debugger always shows the ZBOSS thread's stack in different states, then resuming causes the breakpoint on the hardfault handler to be triggered(with no relevant information in the hard fault registers, and a corrupt stack for the ZBOSS thread). The ZBOSS thread seems to be the likely cuplrit, since the stacks of all the other threads show them having yielded, with only the ZBOSS thread having a corrupt stack. Enabling asserts doesn't trigger any additional logs/faults, so it doesn't seem like the information being passed to the stack is incorrect.

Thanks for reading.

Parents

0 timl2415 over 1 year ago

I've also seen this issue crop up on both a coordinator and zigbee end device application. On the coordinator during device discovery, I've seen the zboss thread block other threads eventually triggering a wdt reset. In Ozone, halting during the "hang" always lands in a zboss API.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 timl2415 over 1 year ago in reply to timl2415

I also configured sysview to observe the system state in this state. It seems like there are a ton of isr's in the zboss thread. Still working on figuring out what these signals are. csv attached

zboss_freeze_sysview.csv
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Maria Gilje over 1 year ago in reply to timl2415

Hi Tim,

I missed your update to this ticket. If you still need support on this, please create a new ticket.

Best regards,

Maria
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Maria Gilje over 1 year ago in reply to timl2415

Hi Tim,

I missed your update to this ticket. If you still need support on this, please create a new ticket.

Best regards,

Maria
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 NoahMetz over 1 year ago in reply to Maria Gilje

Hi Maria,

I would appreciate if all relevant information could be contained within the ticket, including reproductions by other people.

The independent reproductions seems to confirm that this issue is present in the ZigBee stack. Is there any confirmation of a reproduction on your end?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 timl2415 over 1 year ago in reply to NoahMetz

Agree with the above, this ticket is unresolved, recent and is likely the same issue as what I am experiencing.

I dug down further into this by adding print statements to the radio layer zboss uses in zb_nrf_transceiver.c. It seems like zboss gets into this resource hogging state when it runs out of zb buffers. I already have zb_mem_config_max.h included but I still seem to run into cases where buffers run out.

It seems like zb_trans_get_next_packet is polled until no more data is left on the radio, this api receives a new zboss buffer in a loop until all data is collected. During times of high amounts of data to rx (device interview, network start presumably due to panid conflict resolution), I've seen the number of unique buffers handed to this API reach some limit after which the stack invariably gets into the bad state.

In this state it seems like zboss cannot handle the data that is already in the packets since there was an error during rx, but it also cannot sleep since there is data to process. Funnily enough increasing the amount of net_bufs actually exacerbates this issue and more bufs end up being allocated during an rx call.

I ended up having to make a custom config with the largest amount of buffers in the buffer pool as to avoid running out ever.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel