Zboss "crash" in zb_schedule_alarm()

Hello,

I'm using the zboss stack to implement a zigbee end device on nrf5340.

My device is able to connect to a zigbee network and to notify sensor events on zigbee but after a random time a crash happens on the stack side :

I'm quite stuck to debug this issue.

it seems that the stack stops because of an error executing zb_schedule_alarm() but I have no more clues.

How do you debug such issue ?

Environment :

* nRF5340

* NRF_SDK_VERSION=v1.0.1
  NRF_SDK_NAME=ncs-zigbee-r22
  NRF_MAIN_SDK_VERSION=v2.9.0

Regards,

Gaël

Parents
  • Hi Gaël

    The team has decoded your provided traces and they show the exact line in zb_schedule_alarm where the assert occurs. The reason is that the alarm queue is filled up, reaching ZB_SCHEDULER_Q_SIZE.

    A reason might be that zb_buf_get_out_delayed_ext from the provided code is being called inside interrupt context. zb_buf_get_out_delayed_ext calls zb_schedule_alarm internally and race conditions might occur.

    A safer approach would be to use ZB_SCHEDULE_APP_CALLBACK to schedule the test_send_event_handler actions using ZBOSS context.

    Regards,
    Amanda H.

  • Hello Amanda,

    thanks for your feedback.

    1. The trace I sent you was from my firmware (not the sample app) and in my firmware I already use ZB_SCHEDULE_APP_CALLBACK() to send an event on zigbee

    2. I did the change on the sample app...but the issue still happen. I get back to you with a trace log on the sample.

    static void send_step_cmd_cb(zb_bufid_t cmd_id)
    {
    	zb_ret_t zb_err_code;
    
    	/* Allocate output buffer and send step command. */
    	zb_err_code = zb_buf_get_out_delayed_ext(light_switch_send_on_off,
    							cmd_id,
    							0);
    	if (!zb_err_code) {
    		LOG_WRN("Buffer is full");
    	}
    }
    
    static void test_send_event_handler(struct k_timer *timer)
    {
            zb_uint16_t cmd_id;
    
    	/* toogle button state */
    	if (buttons_ctx.state == BUTTON_ON) {
    		buttons_ctx.state = BUTTON_OFF;
    		cmd_id = ZB_ZCL_CMD_ON_OFF_OFF_ID;
    	} else {
    		buttons_ctx.state = BUTTON_ON;
    		cmd_id = ZB_ZCL_CMD_ON_OFF_ON_ID;
    	}
    
    	zb_ret_t ret = ZB_SCHEDULE_APP_CALLBACK(send_step_cmd_cb, cmd_id);
    	if (ret != 0) {
    		LOG_WRN("Err scheduling send_step_cmd_cb");
    	}
    }

    3. In my test the send period is 5s and I get the warning "Buffer is full" (~200 times in ~20 minutes)

    3. I noticed zb_buf_get_out_delayed_ext was calling zb_schedule_alarm, has ZB_SCHEDULE_APP_CALLBACK does. And looking at the example (If I read correctly), the original call to zb_buf_get_out_delayed_ext() is done in the context of the system workqueue context, so it looks like the same context as using timer i.e. not the zboss thread).

    4. I understand an alarm queue is filled when the issue happens, can you share when does the alarm queue is emptied ? And what might prevent the queue to be emptied ?

  • 0xGael said:
    I get back to you with a trace log on the sample.

    A trace log would help investigate the issue. 

  • zboss_trace_20260108_155224.bin

    I used the same log level mask as before : 

    CONFIG_ZBOSS_TRACE_MASK=0x00000C5F
Reply Children
  • We’ve been trying to reproduce the issue from your latest provided code with no success, and have tried different periodic event intervals. Maybe the reason is that we are using another Nordic DK as ZC, while you use a Home Assistant one. It would be good to see if you are able to reproduce with our network coordinator sample.

    One more thing is that the "Buffer is full" log shows so often because the check condition is wrong; it should be if (zb_err_code != RET_OK). I realized we also have a wrong check in the Light Switch sample code, which should be fixed (if (!zb_err_code)).

    What traces show is that there is a function flooding the queue, 0003f661. Can you try to find which function it is?
    arm-none-eabi-addr2line -e build/light_switch/zephyr/zephyr.elf 0003f661

  • Hello Amanda,

    thanks for the support.

    Maybe the reason is that we are using another Nordic DK as ZC, while you use a Home Assistant one.

    I supposed as you and did a test previously : I paired the light_switch with the example coordinator and didn't see a crash neither.

    The point is the coordinator example doesn't implement the switch profile. I planned to implement the switch server profile in the coordinator but had other priorities, I will do it when possible.

    I supposed you did the same test ? Just connect the light_switch with the coordinator and check if it crashed (I mean you didn't implement the server profile) ?

    What traces show is that there is a function flooding the queue, 0003f661. Can you try to find which function it is?

    arm-none-eabi-addr2line didn't return a function name. I suppose the binary might have changed since I did the test. I send you back another log with the binary used.

    zephyr.elf.txt

    (remove the .txt extension)

    zboss_trace_20260109_104736.bin

  • 0xGael said:
    I supposed you did the same test ? Just connect the light_switch with the coordinator and check if it crashed (I mean you didn't implement the server profile) ?

    I just flashed regular network coordinator plus light bulb samples into two other boards.

    I'm not sure if we'll be able to reproduce the issue with that project, so it would be helpful if you could provide some traces that reproduce the crash in the meantime.

  • Hello ,

    in the meantime, have you been able to decode the address of the function which floods the queue ?

  • Hi, 

    • The latest shared traces didn't reproduce the issue. New traces (provided on Jan 9, 2026) don’t show the crash.
    • zboss_trace_20260108_155224.bin from Jan 8, 2026 show the crash and the scheduler queue full with mostly function address 0003f661.

    • ELF from Jan 9, 2026 show that culprit function is zb_nwk_ed_send_timeout_req, but we would need to double check on a combination of build files + traces that show the crash. Function pointers have bit 0 set to 1 to indicate ARM Thumb Mode.
      arm-none-eabi-nm build_dk/light_switch/zephyr/zephyr.elf | grep "0003f660"
      0003f660 T zb_nwk_ed_send_timeout_req
      Detailed analysis from the traces back up this theory.

    • For now, there are no plans to provide R22 add-on updates with new libraries. Recommendation would be to move to the R23 add-on and use zboss_use_r22_behavior() if needed. May I know why you are using the R22 add-on instead of R23?

    • A workaround on the customer side could be to increase the value of ZB_SCHEDULER_Q_SIZE from 24 to 32 or even 48.

    -Amanda H.

Related