Disconnect/reconnect loop when shadow messages are larger than modem 2K TLS buffer

Hi,

We're aware of the modem 2K TLS buffer limitation. It's causing us quite a bit of headache related to the device shadow.

Our application needs a number of config parameters in the shadow, but even with space-saving measures (short property keys, etc) we are running into problems. 

There are two aspects, which combine to make the situation problematic:

1. The current nRF lib implementation appears to instantly disconnect from the cloud whenever there is a shadow message that is too large. Worse, the offending shadow message is not dropped, so when the device tries again it finds itself in an endless disconnect/reconnect loop. This essentially forces the device offline because it cannot communicate any useful messages to the cloud that could help analyze or fix the situation, e.g. send debug information, do a FOTA update, etc.

2. The device cannot control the size or timing of the shadow messages the cloud sends to the device, nor can it ignore the shadow messages. So, once the cloud sends an offending shadow message, the device ends up in the above disconnect/reconnect loop that it cannot control. 

As a result, accidental or unexpectedly large shadow changes can force thousands of devices permanently offline and there is no indication as to what's wrong because the device just repeatedly disconnects before any useful information is exchanged.

The only way we have found to recover the situation is to "blindly" nuke the shadow of every affected device and hope that reduces the size of the next shadow message enough to allow the devices to re-establish and maintain a connection and get back on it's feet.

We understand there are workarounds to mitigate the problem (for example, spit shadow updates into multiple steps). But they have disadvantages and can be brittle.

It really seems the nRF lib should should be more robust in this regard. When receiving large shadow messages, instead of just disconnecting, it would be better to reject the shadow message, issue an error, and stay connected. This would allow the application to handle the error gracefully, rather than just going offline.

Curious to hear your thoughts. Maybe we are missing something?

Thanks

Parents Reply Children
Related