Disconnect/reconnect loop when shadow messages are larger than modem 2K TLS buffer

Hi,

We're aware of the modem 2K TLS buffer limitation. It's causing us quite a bit of headache related to the device shadow.

Our application needs a number of config parameters in the shadow, but even with space-saving measures (short property keys, etc) we are running into problems. 

There are two aspects, which combine to make the situation problematic:

1. The current nRF lib implementation appears to instantly disconnect from the cloud whenever there is a shadow message that is too large. Worse, the offending shadow message is not dropped, so when the device tries again it finds itself in an endless disconnect/reconnect loop. This essentially forces the device offline because it cannot communicate any useful messages to the cloud that could help analyze or fix the situation, e.g. send debug information, do a FOTA update, etc.

2. The device cannot control the size or timing of the shadow messages the cloud sends to the device, nor can it ignore the shadow messages. So, once the cloud sends an offending shadow message, the device ends up in the above disconnect/reconnect loop that it cannot control. 

As a result, accidental or unexpectedly large shadow changes can force thousands of devices permanently offline and there is no indication as to what's wrong because the device just repeatedly disconnects before any useful information is exchanged.

The only way we have found to recover the situation is to "blindly" nuke the shadow of every affected device and hope that reduces the size of the next shadow message enough to allow the devices to re-establish and maintain a connection and get back on it's feet.

We understand there are workarounds to mitigate the problem (for example, spit shadow updates into multiple steps). But they have disadvantages and can be brittle.

It really seems the nRF lib should should be more robust in this regard. When receiving large shadow messages, instead of just disconnecting, it would be better to reject the shadow message, issue an error, and stay connected. This would allow the application to handle the error gracefully, rather than just going offline.

Curious to hear your thoughts. Maybe we are missing something?

Thanks

  • Hello,

    I have asked the Cloud team if they could provide some insight, but since we are closely approaching Christmas, this may not happen until after the New Year.

    Best regards,

    Michal

  • Hi Michael,

    No worries, thank you. We are currently working around this, but we're looking for a more robust and permanent solution, so we can wait. Happy Holidays!

  • Hello  

    Sorry to hear that you are having issues with device shadow. I would like to understand a bit more this problematic:

    - Which protocol are you using?
    - Are you able to recreate this problem locally?

    Regards,

    Pascal.

  • Hi  ,

    We're using the default MQTT setup for device communication:
    CONFIG_NRF_CLOUD_MQTT=y
    CONFIG_NRF_CLOUD_MQTT_SHADOW_TRANSFORMS=y

    and we handle the shadow events as per the Nordic examples:
    NRF_CLOUD_EVT_RX_DATA_SHADOW
    NRF_CLOUD_EVT_TX_DATA_SHADOW

    >Are you able to recreate this problem locally?

    We can recreate the problem reliably on every device, but not only locally of course, because the cloud must send a shadow update that is larger than 2k to trigger the problem.

    I think you should be able to confirm this yourself fairly easily:

    1. Extend a device shadow with an application-specific "reported" section containing a number of key/value structures (like 15 structures sized 70 bytes each, which is about 1 KB in size).

    2. Then, on the cloud, issue a "desired" update where most of these key/value structures are changed --> this results in a large shadow delta.

    3. The cloud will send a large shadow delta update message to the device. 

    4. The device is forced to receive this message. Depending on the structure and changes in the shadow, the message will likely exceeds the 2k TLS modem buffer and the device disconnects immediately

    5. Since the shadow message was not acknowledged it remains in the network queue, so every time the device reconnects it will re-receive the message and disconnect again.

    Thanks

  • Hello,

    Thanks for the input, I will try to recreate the problem and let you know if I managed to do it.

    Regards,

    Pascal.

Related