This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nrf mesh SDK 2.1.1

Hellou, i have questiuons about NRF MESH SDK 2.1.1. 

We have in production street light based on light_switch_server_nrf52832_xxAA_s132_6_0_0 from NRF SDK 2.1.1.

When we tested communication, we have problem by changeing IV index by big communication traffic , because  in function static inline bool iv_timeout_limit_passed(uint32_t timeout), was timeout set to 96 hours. We needs changing IV index in test more often because we updated FW remotely by mesh. So we change this timeout to 4 minutes. Its work well on small test mesh network(3 devices). IV index was changing and communication work properly.

But we have this software aplicated on bigger mesh network (more than 80 devices) . In this network after provisioning device communicated first 2-3 weeks without any problem, but after its happens that device randomly stop communicate. Devices dont response on commands. On devices in this network are permantly sends commands - here is big communication traffic.

Communication is based on command-response principle. 

When the device stop communicate - server doesn;t communicate with client till to power of reset. When the power of reset is aplicated, device start again communicate with client. 

Its is possible that is because we change the time in static inline bool iv_timeout_limit_passed(uint32_t timeout) from 96 hours to 4 minutes.?

Have any same problem with that device stop communicate?  Any advice how we can fix this bug? Changing SDK version for higher version is not possible in this moment - devices are installed. 

Parents
  • Hellou, so i tested issue. 

    That device is not response ion the cmd is not because that have bad seqeunce number or IV index. IV index was 0 , and sequence number was too low for IV index changing. 

    We has 4 device which was not response on cmd. Every node(server) has build in function , that not receive more than 5 minutes any cmd, device is going unprovisioned. When device is unprovisioned the light go to state (OFF-ON-OFF-ON) .

    At by disconnecting master(client) to 5 minutes,all light was OFF. After 5 minutes all lights blink, and go to unprovisioned state. But that 4 device which not response to cmd not go blink and unprovisioned - this was still OFF. After this i reset power of this 4 device and wait 5 minutes to unprovisioned device - device blink and go to unprovisioned state. 

    In program we have turn on watchdog.

    Every place where is possible that device freeze I check and of all of loop wdt reset device.  

    There is also softdevice s132_nrf52_6.0.0_softdevice. Its possible that device freeze somewhere in softdevice? 

    If device go to hardfault interrupt and is not define hardfault_handler - is device freeze? 

Reply
  • Hellou, so i tested issue. 

    That device is not response ion the cmd is not because that have bad seqeunce number or IV index. IV index was 0 , and sequence number was too low for IV index changing. 

    We has 4 device which was not response on cmd. Every node(server) has build in function , that not receive more than 5 minutes any cmd, device is going unprovisioned. When device is unprovisioned the light go to state (OFF-ON-OFF-ON) .

    At by disconnecting master(client) to 5 minutes,all light was OFF. After 5 minutes all lights blink, and go to unprovisioned state. But that 4 device which not response to cmd not go blink and unprovisioned - this was still OFF. After this i reset power of this 4 device and wait 5 minutes to unprovisioned device - device blink and go to unprovisioned state. 

    In program we have turn on watchdog.

    Every place where is possible that device freeze I check and of all of loop wdt reset device.  

    There is also softdevice s132_nrf52_6.0.0_softdevice. Its possible that device freeze somewhere in softdevice? 

    If device go to hardfault interrupt and is not define hardfault_handler - is device freeze? 

Children
  • Next questions:

    1. WDT priority is setup WDT_CONFIG_IRQ_PRIORITY 7. Program consist of softdevice, bootloader and main program. 

    WDT is started in bootloader, after jump to main program where is watchdog feeds. Every possible loop where tha main program can freeze it was tested and wdt reset the device.  

    In topic https://devzone.nordicsemi.com/f/nordic-q-a/29788/soft-device-assert/118164#118164 is describe :

    4.It depends on the SoftDevice if it will continue or not (even though it really should stop). The SoftDevice will disable all interrupts, and whatever happens after the assertion is undefined behavior. ---->>>if device stop and not continue - wdt timer is off because all inteerupts is disabled and so is no possible for wdt reset the device, right?

    Its any next possible way how device reset in this state, if wdt not reset device? By software, because power down or external reset is no posiible? 

    2. We use actually mesh SDK 2.1.1 and softdevice  s132_nrf52_6.0.0_softdevice in NRF52832_AA version chip.  If we go to higher version of MESH SDK and softdevice it is possible that same problem will be in this higher version? If yes, that only one way how can work correctly its take on PCB next external watchdog which check is device is working , or its somewhere freeze for example in softdevice. 

  • Hi,

    1. Behaviour after SoftDevice assert depends on the type of build. If it is a debug build, then it enters an infinite loop. That makes it easy to fetch the details about the assert from a debug session. If it is a release build, then it resets the device. At least that is the default behaviour from our SDK. If you have changed that in code, then the behaviour may be different.

    2. It may be a good idea to upgrade to a later SDK+SoftDevice, as there has been bugfixes and features added since nRF5 SDK for Mesh v2.1.1. However, I think the issue that you are seeing is not something we have seen in the SDK, and most likely related to your particular implementation.

    Regarding the changes to iv_timeout_limit_passed(), please note that this is part of the Bluetooth mesh stack. You are not supposed to change the Bluetooth mesh stack, and the qualifications we have done for the stack is only valid if you leave it unchanged. I suspect that something related to the issues were introduced by you changing the stack. In particular the change from 96 hours timeout to 5 minute timeout sounds like a recipe for disaster. What other stack changes have you done?

    The mechanism that you mention, for making the device go into unprovisioned state if it has not received any commands for five minutes, how is that one implemented?

    How are you checking the device in all possible freeze locations, and what do you do on that check?

    Freezes inside of the SoftDevice are highly unlikely. I have not yet seen any issue of freezing inside of SoftDevice, and if it happened then your watchdog should trigger. As previously mentioned, asserts inside of the SoftDevice would go to the fault handler, and the default fault handling in the SDK is to reset the device (at least in release builds).

    Regards,
    Terje

Related