Zboss "crash" in zb_schedule_alarm()

Hello,

I'm using the zboss stack to implement a zigbee end device on nrf5340.

My device is able to connect to a zigbee network and to notify sensor events on zigbee but after a random time a crash happens on the stack side :

I'm quite stuck to debug this issue.

it seems that the stack stops because of an error executing zb_schedule_alarm() but I have no more clues.

How do you debug such issue ?

Environment :

* nRF5340

* NRF_SDK_VERSION=v1.0.1
  NRF_SDK_NAME=ncs-zigbee-r22
  NRF_MAIN_SDK_VERSION=v2.9.0

Regards,

Gaël

  • Hello,

    I reproduce the same issue on nRF5340DK + example light_switch + send of periodic event.

    My network controller is Home assistant + ZHA + controller  EFR32MG21 based (firmare EZSP).

    To send periodic events (every 500 ms) :

    static void alarm_timers_init(void)
    {
    	k_timer_init(&buttons_ctx.alarm, light_switch_button_handler, NULL);
    	k_timer_init(&bulb_ctx.find_alarm, find_light_bulb_alarm, NULL);
    	k_timer_init(&periodic_send_evt_ctx.alarm, test_send_event_handler, NULL);
    }
    
    //...
    
    static void test_send_event_handler(struct k_timer *timer)
    {
    	zb_ret_t zb_err_code;
    	zb_uint16_t cmd_id;
    
    	/* toogle button state */
    	if (buttons_ctx.state == BUTTON_ON) {
    		buttons_ctx.state = BUTTON_OFF;
    		cmd_id = ZB_ZCL_CMD_ON_OFF_OFF_ID;
    	} else {
    		buttons_ctx.state = BUTTON_ON;
    		cmd_id = ZB_ZCL_CMD_ON_OFF_ON_ID;
    	}
    
    	/* Allocate output buffer and send step command. */
    	zb_err_code = zb_buf_get_out_delayed_ext(light_switch_send_on_off,
    							cmd_id,
    							0);
    	if (!zb_err_code) {
    		LOG_WRN("Buffer is full");
    	}
    }
    
    //...
    void zboss_signal_handler(zb_bufid_t bufid)
    {
            //...
        case ZB_BDB_SIGNAL_STEERING:
     		ZB_ERROR_CHECK(zigbee_default_signal_handler(bufid));
     		if (status == RET_OK) {
     			/* Check the light device address. */
    #if 0
     			if (bulb_ctx.short_addr == 0xFFFF) {
     				k_timer_start(&bulb_ctx.find_alarm,
     					      MATCH_DESC_REQ_START_DELAY,
     					      MATCH_DESC_REQ_TIMEOUT);
     			}
    #endif
    			/* TEST : start timer to do periodic send over zigbee */
    			LOG_INF("Start periodic send event");
    			k_timer_start(&periodic_send_evt_ctx.alarm, BUTTON_LONG_POLL_TMO,
    			      BUTTON_LONG_POLL_TMO);
            //...
    }

    To pair with HA, the following option is enabled :

    CONFIG_ZIGBEE_CHANNEL_SELECTION_MODE_MULTI=y

  • Hello Nordic Team,

    any feedback on this issue ? Are you able to reproduce ?

    I tried to change the period of the send of event ([50ms...5s]) and the issue happens in about 20 min every time.

  • Hi Gaël

    The team has decoded your provided traces and they show the exact line in zb_schedule_alarm where the assert occurs. The reason is that the alarm queue is filled up, reaching ZB_SCHEDULER_Q_SIZE.

    A reason might be that zb_buf_get_out_delayed_ext from the provided code is being called inside interrupt context. zb_buf_get_out_delayed_ext calls zb_schedule_alarm internally and race conditions might occur.

    A safer approach would be to use ZB_SCHEDULE_APP_CALLBACK to schedule the test_send_event_handler actions using ZBOSS context.

    Regards,
    Amanda H.

  • Hello Amanda,

    thanks for your feedback.

    1. The trace I sent you was from my firmware (not the sample app) and in my firmware I already use ZB_SCHEDULE_APP_CALLBACK() to send an event on zigbee

    2. I did the change on the sample app...but the issue still happen. I get back to you with a trace log on the sample.

    static void send_step_cmd_cb(zb_bufid_t cmd_id)
    {
    	zb_ret_t zb_err_code;
    
    	/* Allocate output buffer and send step command. */
    	zb_err_code = zb_buf_get_out_delayed_ext(light_switch_send_on_off,
    							cmd_id,
    							0);
    	if (!zb_err_code) {
    		LOG_WRN("Buffer is full");
    	}
    }
    
    static void test_send_event_handler(struct k_timer *timer)
    {
            zb_uint16_t cmd_id;
    
    	/* toogle button state */
    	if (buttons_ctx.state == BUTTON_ON) {
    		buttons_ctx.state = BUTTON_OFF;
    		cmd_id = ZB_ZCL_CMD_ON_OFF_OFF_ID;
    	} else {
    		buttons_ctx.state = BUTTON_ON;
    		cmd_id = ZB_ZCL_CMD_ON_OFF_ON_ID;
    	}
    
    	zb_ret_t ret = ZB_SCHEDULE_APP_CALLBACK(send_step_cmd_cb, cmd_id);
    	if (ret != 0) {
    		LOG_WRN("Err scheduling send_step_cmd_cb");
    	}
    }

    3. In my test the send period is 5s and I get the warning "Buffer is full" (~200 times in ~20 minutes)

    3. I noticed zb_buf_get_out_delayed_ext was calling zb_schedule_alarm, has ZB_SCHEDULE_APP_CALLBACK does. And looking at the example (If I read correctly), the original call to zb_buf_get_out_delayed_ext() is done in the context of the system workqueue context, so it looks like the same context as using timer i.e. not the zboss thread).

    4. I understand an alarm queue is filled when the issue happens, can you share when does the alarm queue is emptied ? And what might prevent the queue to be emptied ?

  • 0xGael said:
    I get back to you with a trace log on the sample.

    A trace log would help investigate the issue. 

Related