random ZBOSS fatal error occurred when operating after a while

I am running a nrf5340 with zigbee as a coordinator, I observed after a while that battery life of my end devices don't last as expected. Then after debugging I found that a random ZBOSS fatal error causes the coordinator to restart after a while with no other logs, just that single message and then restart. this causes all end devices to redo some on-connect  messages when they rediscover the coordinator and that drains the battery. unfortunately, I did not find any way to enable more debugging logs, all there is, is the trace logs which I cannot read and I have to send to you, so you can decode and maybe get something from. Is there a strait forward way to look at the potential reason for this fatal error instead of throwing assumptions until it stops happening? I have no idea on how to begin debugging this because I don't see an obvious way to review what is happening inside the zboss stack. 

Parents
  • Hello,

    What exactly does the log say before it restarts? 

    And please upload the trace logs. I can forward it to our Zigbee team, so that they can decode it and have a look. 

    What SDK are you using? nRF5 SDK for Thread and Zigbee, or NCS? And in either case, which version?

    How long does it usually take before the issue occurs? 

    Have you tried capturing a sniffer trace of the issue? If not, that could also be a good idea. You can use the nRF Sniffer for 802.15.4. It will show if anything particular is going on before the issue occurs. 

    Best regards,

    Edvin

  • im using ncs. it takes about 30 minutes for the issue to happen, the logs say nothing,, just "zboss fatal error occurred" then reset. Im using 2.3.0 on the coordinator on an nrf5340 and 2.1.0 on the end device. The reason they don't match is every new update of zephyr keeps bricking all the older versions, so we stick to the same versions. I traced the issue a bit more, and shortly before the error happens, the end device becomes unreachable even though it still perfectly runnning. I checked, and the end device does not generate a network leave signal, and does not attempt to rejoin, so the end device does not even know that it's not reachable by the coordinator. The coordinator generates an NLME signal 0x06 which means the packets are being dropped due to unreachable end device. idk what is the reason, more annoyingly, I can't debug any zigbee related issue since the start of this project. I have to guess what is going on. I will use a sniffer to capture what is happening. 

Reply
  • im using ncs. it takes about 30 minutes for the issue to happen, the logs say nothing,, just "zboss fatal error occurred" then reset. Im using 2.3.0 on the coordinator on an nrf5340 and 2.1.0 on the end device. The reason they don't match is every new update of zephyr keeps bricking all the older versions, so we stick to the same versions. I traced the issue a bit more, and shortly before the error happens, the end device becomes unreachable even though it still perfectly runnning. I checked, and the end device does not generate a network leave signal, and does not attempt to rejoin, so the end device does not even know that it's not reachable by the coordinator. The coordinator generates an NLME signal 0x06 which means the packets are being dropped due to unreachable end device. idk what is the reason, more annoyingly, I can't debug any zigbee related issue since the start of this project. I have to guess what is going on. I will use a sniffer to capture what is happening. 

Children
  • Hello,

    Can you please try to add:

    CONFIG_ZBOSS_ERROR_PRINT_TO_LOG=y

    to your prj.conf, and see what the log says then?

    Is there a way for me to reproduce what you are seeing? Are you able to reproduce the issue with a couple of nRF5340 DKs?

    Best regards,

    Edvin

  • That was already enabled, still no output. I found the reason and fixed it by disabling all the zboss related functions and enabling them one by one, which is not the fastest way to debug an issue. I found a buffer allocation function that was returning a zigbee invalid buffer that was not valid, and was not raising an error and was passing all If statement checks of the buffer. The problem was it was not failing the first time it happened and it just kept going on until zigbee crashed somewhere else becuase of the invalid memory address issue. I fixed it by making this specific buffer allocation deferred instead of immediate. I still don't know why this specific buffer is causing an issue. If I can go back in time, I would just pick a different zigbee backend instead of nrf. 

Related