This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Frequent mesh device power cycle makes it unable to communicate after a while

We are using mesh SDK v3.2.0, nRF SDK 15.3, nRF52840 DK and SES as our IDE.

We had a mesh network with around 5 devices and after a while 2 of them just stopped sending data to the network, while being perfectly able to receive from it.

After debugging one of those devices, we found that the condition m_net_state.seqnum < m_net_state.seqnum_max_available on net_state_seqnum_alloc function inside  net_state.c was not being verified, leading to NRF_ERROR_FORBIDDEN being returned from net_state_seqnum_alloc silently.

After searching around the devzone, we found something called the iv_index and iv_update procedure. Basically, the iv_index extends the sequence number of the mesh messages leaving the sender devices, so that the sequence number doesn't overflow. Given some conditions, this iv_index is incremented on the whole mesh network from any of its devices if they trigger an iv_update procedure, after which the sequence number is returned to 0.

From here, we also understood that sequence numbers are only stored on the flash of each device every 8192 increments, and each time the device power cycles, it "jumps" 8192 sequence numbers in front of the stored one. This makes sense, as to not wear the flash too much.

This leads to an easy to understand problem, that is power cycling devices frequently exhausts the maximum allowed sequence numbers very fast and we think that is what happened over 3 or 4 days (happened a month ago) with the failing devices. We needed to power cycle them once every couple minutes during work hours (we have since changed that) so around ~1000 times maybe.

We understand the importance of the iv_index, what we don't understand is why the device that sees it's sequence number approaching its limit does not trigger an iv_update procedure and avoids this problem of reaching the maximum allowed sequence number right from the bat after rebooting. For me at least it does not make any sense that a signal level device consuming less than 10 watts or something has software problems from rebooting too frequently. I feel like we're missing something or we understood something wrong from reading posts on this forum and the Bluetooth documentation. Shouldn't an iv_update be issued if the device reboots and loads a high enough sequence number from flash memory?

Thank you,

Rúben Marques

Parents
  • Hi,

    The sequence number can only be reset after the end of the IV update procedure. One of the requirement of the Mesh specification is "If the node is added to a network when the network is in Normal operation, then it shall operate in Normal operation for at least 96 hours." Because of this, an internal IV update procedure in the mesh stack will not trigger before at least 96 hours.

    Regarding your issue, there is a timer for monitoring the IV update procedure, and will avoid triggering the IV update procedure if it has been triggered less than 96 hours. When the device resets, this timer will also reset. Resulting in that the node would never request IV update procedure (to reset sequence number) and end up running out of sequence number. This issue has been fixed in Mesh SDK v4.0 that we store the timer every 30 minutes to flash. This way if the node get reset, the time will start counting from last saved time then still can trigger a IV Index update in relatively correct 96 hours.

    Is it possible to upgrade to our latest Mesh SDK(v4.2)?

    This might also be useful.

  • Hello Mttrinh, you've been helpful, thank you.

    We have to take care about what we changed on the nordic SDK to be able to make an update to the Mesh SDK, but it is nice to know that that work will be useful to benefit from that 4.0 feature.

Reply Children
No Data
Related