ZigBee Application crash - ERROR 29 [RET_OVERFLOW] at [...]zigbee_app_utils.c:849

Application crashes with the following log

[06:45:22.727,691] <err> zigbee_app_utils: ERROR 29 [RET_OVERFLOW] at WEST_TOPDIR/nrf/subsys/zigbee/lib/zigbee_app_utils/zigbee_app_utils.c:849
[06:45:22.727,691] <err> zboss_osif: ZBOSS fatal error occurred
The relevant lines from zigbee_app_utils
 
#if defined CONFIG_ZIGBEE_ROLE_END_DEVICE
wait_for_user_input = false;
is_rejoin_start_scheduled = false;

zb_ret_t zb_err_code = ZB_SCHEDULE_APP_ALARM(
stop_network_rejoin,
ZB_TRUE,
ZB_MILLISECONDS_TO_BEACON_INTERVAL(
ZB_DEV_REJOIN_TIMEOUT_MS));
ZB_ERROR_CHECK(zb_err_code);
#endif
It seems that the Alarm cannot be scheduled due to RET_OVERFLOW.

- The implementation of the signal handler is here https://github.com/Rogger/schneggi-sensor/blob/64b2408f1053a15622c618249d887ca7694eb250/src/main.c#L704, in there is as well the call to app utils.
Context:
SDK & toolchain in use 2.5.0
Any help or guidance is highly appreciated, thank you in advance
Best Regards
Michael
Parents
  • Hi,

    Could you get me the network key for this trace?

    Regards,

    Elfving

  • Hi,

    Sure, the key: ffb12ed4228949b66f1a72143a263808 (normal byte order).

    Thanks!

  • If you are getting an overflow in ZB_SCHEDULE_APP_ALARM it would seem that there are too many things being schedualed in the ZBOSS queue. It might be that it helps implementing the workaround for issue KRKNWK-12017 (even though you aren't facing that exact issue), there seems to be something strange going on with the joining as you are getting an overflow in the function that is supposed to start the rejoin. 

    If that doesn't help you could also try increasing the scheduler queue. For that you'd have to use a custom memory config file, like include/zb_mem_config_custom.h in the light switch sample. In order to do that, copy all the configs from the memory config file you are allready using (I assume you are either using zb_mem_config_min.h or zb_mem_config_med.h). In the custom memory file, find this section:

    /* Now if you REALLY know what you do, you can study zb_mem_config_common.h and redefine some configuration parameters, like:
    #undef ZB_CONFIG_SCHEDULER_Q_SIZE
    #define ZB_CONFIG_SCHEDULER_Q_SIZE 56
    */

    Comment it out, and increase ZB_CONFIG_SCHEDULER_Q_SIZE. It can go between 48 and 256.

    Regards,

    Elfving

  • Hi Elfving,

    Thank you very much for your help, much appreciated. I implemented the workaround for KRKNWK-12017. I will report back if it helps.

    Best Regards
    MR

  • Hi Elfving,

    I implemented the workaround KRKNWK-12017 and increased the Q_SIZE to 128.

    The rejoin functionality seems to work now correctly, but unfortunately the fatal error still exists. It might take longer to crash than before, in my case about 33 hours.

    [33:38:13.675,354] <dbg> zboss_osif: zigbee_event_poll: Received new Zigbee event: 0x01
    [33:38:13.750,335] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 67 23  7c 00 39 03 1f 00 00 00 |......g# |.9.....
                                         31 53 03 00 3a a8 78 00  00 00 00 00             |1S..:.x. ....    
    --- 117 messages dropped ---
    [33:38:13.750,396] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 68 23  7c 00 39 03 36 00 00 00 |......h# |.9.6...
                                         31 53 03 00 41 a8 78 00  00 00 00 00             |1S..A.x. ....    
    [33:38:13.750,488] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 69 23  7c 00 39 03 5e 00 00 00 |......i# |.9.^...
                                         31 53 03 00 48 a8 78 00  00 00 00 00             |1S..H.x. ....    
    [33:38:13.750,549] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6a 23  7c 00 39 03 21 00 00 00 |......j# |.9.!...
                                         31 53 03 00 4f a8 78 00  00 00 00 00             |1S..O.x. ....    
    [33:38:13.750,640] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6b 23  7c 00 39 03 64 00 00 00 |......k# |.9.d...
                                         31 53 03 00 57 a8 78 00  00 00 00 00             |1S..W.x. ....    
    [33:38:13.750,732] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6c 23  7c 00 39 03 0b 00 00 00 |......l# |.9.....
                                         31 53 03 00 5e a8 78 00  00 00 00 00             |1S..^.x. ....    
    [33:38:13.750,793] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6d 23  7c 00 39 03 51 00 00 00 |......m# |.9.Q...
                                         31 53 03 00 65 a8 78 00  00 00 00 00             |1S..e.x. ....    
    [33:38:13.750,885] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6e 23  7c 00 39 03 2d 00 00 00 |......n# |.9.-...
                                         31 53 03 00 6c a8 78 00  00 00 00 00             |1S..l.x. ....    
    [33:38:13.750,946] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6f 23  7c 00 39 03 01 00 00 00 |......o# |.9.....
                                         31 53 03 00 74 a8 78 00  00 00 00 00             |1S..t.x. ....    
    [33:38:13.751,037] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 70 23  7c 00 39 03 15 00 00 00 |......p# |.9.....
                                         31 53 03 00 7a a8 78 00  00 00 00 00             |1S..z.x. ....    
    [33:38:13.751,129] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 71 23  7c 00 39 03 35 00 00 00 |......q# |.9.5...
                                         31 53 03 00 82 a8 78 00  00 00 00 00             |1S....x. ....    
    [33:38:13.751,190] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 72 23  7c 00 39 03 7c 00 00 00 |......r# |.9.|...
                                         31 53 03 00 89 a8 78 00  00 00 00 00             |1S....x. ....    
    [33:38:13.751,281] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 73 23  7c 00 39 03 00 00 00 00 |......s# |.9.....
                                         01 ef 03 00 9c d9 78 00  00 00 00 00             |......x. ....    
    [33:38:13.751,342] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 74 23  7c 00 39 03 05 00 00 00 |......t# |.9.....
                                         29 2a 02 00 40 78 79 00  00 00 00 00             |)*..@xy. ....    
    [33:38:13.751,434] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 0a 02 13 a5 75 23  7c 00 98 03             |......u# |...    
    [33:38:13.751,495] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 12 02 13 a5 76 23  60 08 4c 00 7c 00 00 00 |......v# `.L.|...
                                         99 03 00 00                                      |....             
    [33:38:13.751,525] <err> zboss_osif: ZBOSS fatal error occurred

    I attached the Wireshark capture. Search for the "rejoin request", this is when the device restarted after fatal error.
    290124.zip

    What else could I try?

    Thank you very much
    MR

Reply
  • Hi Elfving,

    I implemented the workaround KRKNWK-12017 and increased the Q_SIZE to 128.

    The rejoin functionality seems to work now correctly, but unfortunately the fatal error still exists. It might take longer to crash than before, in my case about 33 hours.

    [33:38:13.675,354] <dbg> zboss_osif: zigbee_event_poll: Received new Zigbee event: 0x01
    [33:38:13.750,335] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 67 23  7c 00 39 03 1f 00 00 00 |......g# |.9.....
                                         31 53 03 00 3a a8 78 00  00 00 00 00             |1S..:.x. ....    
    --- 117 messages dropped ---
    [33:38:13.750,396] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 68 23  7c 00 39 03 36 00 00 00 |......h# |.9.6...
                                         31 53 03 00 41 a8 78 00  00 00 00 00             |1S..A.x. ....    
    [33:38:13.750,488] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 69 23  7c 00 39 03 5e 00 00 00 |......i# |.9.^...
                                         31 53 03 00 48 a8 78 00  00 00 00 00             |1S..H.x. ....    
    [33:38:13.750,549] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6a 23  7c 00 39 03 21 00 00 00 |......j# |.9.!...
                                         31 53 03 00 4f a8 78 00  00 00 00 00             |1S..O.x. ....    
    [33:38:13.750,640] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6b 23  7c 00 39 03 64 00 00 00 |......k# |.9.d...
                                         31 53 03 00 57 a8 78 00  00 00 00 00             |1S..W.x. ....    
    [33:38:13.750,732] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6c 23  7c 00 39 03 0b 00 00 00 |......l# |.9.....
                                         31 53 03 00 5e a8 78 00  00 00 00 00             |1S..^.x. ....    
    [33:38:13.750,793] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6d 23  7c 00 39 03 51 00 00 00 |......m# |.9.Q...
                                         31 53 03 00 65 a8 78 00  00 00 00 00             |1S..e.x. ....    
    [33:38:13.750,885] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6e 23  7c 00 39 03 2d 00 00 00 |......n# |.9.-...
                                         31 53 03 00 6c a8 78 00  00 00 00 00             |1S..l.x. ....    
    [33:38:13.750,946] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 6f 23  7c 00 39 03 01 00 00 00 |......o# |.9.....
                                         31 53 03 00 74 a8 78 00  00 00 00 00             |1S..t.x. ....    
    [33:38:13.751,037] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 70 23  7c 00 39 03 15 00 00 00 |......p# |.9.....
                                         31 53 03 00 7a a8 78 00  00 00 00 00             |1S..z.x. ....    
    [33:38:13.751,129] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 71 23  7c 00 39 03 35 00 00 00 |......q# |.9.5...
                                         31 53 03 00 82 a8 78 00  00 00 00 00             |1S....x. ....    
    [33:38:13.751,190] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 72 23  7c 00 39 03 7c 00 00 00 |......r# |.9.|...
                                         31 53 03 00 89 a8 78 00  00 00 00 00             |1S....x. ....    
    [33:38:13.751,281] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 73 23  7c 00 39 03 00 00 00 00 |......s# |.9.....
                                         01 ef 03 00 9c d9 78 00  00 00 00 00             |......x. ....    
    [33:38:13.751,342] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 1a 02 13 a5 74 23  7c 00 39 03 05 00 00 00 |......t# |.9.....
                                         29 2a 02 00 40 78 79 00  00 00 00 00             |)*..@xy. ....    
    [33:38:13.751,434] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 0a 02 13 a5 75 23  7c 00 98 03             |......u# |...    
    [33:38:13.751,495] <dbg> zboss_osif: zb_trace_msg_port_do: 
                                         de ad 12 02 13 a5 76 23  60 08 4c 00 7c 00 00 00 |......v# `.L.|...
                                         99 03 00 00                                      |....             
    [33:38:13.751,525] <err> zboss_osif: ZBOSS fatal error occurred

    I attached the Wireshark capture. Search for the "rejoin request", this is when the device restarted after fatal error.
    290124.zip

    What else could I try?

    Thank you very much
    MR

Children
  • Hi Rogger, 

    I struggle with similar issue (still without solution) but I took a while to take a look at your code. 

    Modify your sensor_loop callback as follows: 

    static void sensor_loop(zb_bufid_t bufid){
    	static uint16_t cycles = 0;
    
    	LOG_INF("Loop");
    	update_sensor_values();
    	LOG_DBG("%d", cycles % BATTERY_SLEEP_CYCLES);
    	if (cycles % BATTERY_SLEEP_CYCLES == 0)
    	{
    		update_battery();
    		cycles = 0;
    	}
    	cycles++;
    	LOG_INF("Sleep for %d seconds", SLEEP_INTERVAL_SECONDS);
    	
    	/* ADDON START */
        ZB_SCHEDULE_APP_ALARM_CANCEL(sensor_loop, ZB_ALARM_ANY_PARAM); 	
        /* ADDON END */
        
        ZB_SCHEDULE_APP_ALARM(sensor_loop, bufid, 
            ZB_MILLISECONDS_TO_BEACON_INTERVAL(
              SLEEP_INTERVAL_SECONDS * 1000));
    }



    I do not know if it will help, but I suggest to give it a try.
    Please share the results afterwards ;) 

  • Hello pwpot,

    Have you created another DevZone case for this issue yourself?

    Regards,

    Elfving

  • Hi, Elfving

    I am going to but at this moment I cannot provide any detailed info about the issue as device is already on client site.

    I conclude based on some LED blinks and on my own experience.
    Some time ago I made a mistake in code which resulted in ZBOSS Queue overflow. 
    I was scheduling one of the callbacks multiple times without proper cancellation.
    As a result of that ZBOSS scheduler stopped working and the ZBOSS Queue did not recover from such state.

    In my case the problem is also related with rejoin and it happens less often then how it is in Rogger code, its days, sometimes weeks, not hours. 
    I assume this is because Rogger is scheduling ZBOSS callback to use it as a main while loop (very often) where I am doing it mostly during rejoin.

    I am not sure if ZBOSS deadlocks or sth and if can it be detected by classic watchdog. 

    Is it possible to increase ZB_CONFIG_SCHEDULER_Q_SIZE and still be FOTA compatible ? 
    Developing with ZBOSS guide says that: 
    The memory configuration must not be changed between firmware versions if the firmware is going to be upgraded using the OTA DFU.

    I think it is very generic info and maybe ZB_CONFIG_SCHEDULER_Q_SIZE is not the part of that.

  • Hi Pwpot,

    Thanks for the tip, unfortunately this did not help.

  • Hello and sorry about the wait,

    Have you had any development on this case from your side, or are you still facing the same issue?

    Regards,

    Elfving

Related