Disconnect/reconnect loop when shadow messages are larger than modem 2K TLS buffer

Hi,

We're aware of the modem 2K TLS buffer limitation. It's causing us quite a bit of headache related to the device shadow.

Our application needs a number of config parameters in the shadow, but even with space-saving measures (short property keys, etc) we are running into problems. 

There are two aspects, which combine to make the situation problematic:

1. The current nRF lib implementation appears to instantly disconnect from the cloud whenever there is a shadow message that is too large. Worse, the offending shadow message is not dropped, so when the device tries again it finds itself in an endless disconnect/reconnect loop. This essentially forces the device offline because it cannot communicate any useful messages to the cloud that could help analyze or fix the situation, e.g. send debug information, do a FOTA update, etc.

2. The device cannot control the size or timing of the shadow messages the cloud sends to the device, nor can it ignore the shadow messages. So, once the cloud sends an offending shadow message, the device ends up in the above disconnect/reconnect loop that it cannot control. 

As a result, accidental or unexpectedly large shadow changes can force thousands of devices permanently offline and there is no indication as to what's wrong because the device just repeatedly disconnects before any useful information is exchanged.

The only way we have found to recover the situation is to "blindly" nuke the shadow of every affected device and hope that reduces the size of the next shadow message enough to allow the devices to re-establish and maintain a connection and get back on it's feet.

We understand there are workarounds to mitigate the problem (for example, spit shadow updates into multiple steps). But they have disadvantages and can be brittle.

It really seems the nRF lib should should be more robust in this regard. When receiving large shadow messages, instead of just disconnecting, it would be better to reject the shadow message, issue an error, and stay connected. This would allow the application to handle the error gracefully, rather than just going offline.

Curious to hear your thoughts. Maybe we are missing something?

Thanks

Parents
  • Hello  

    Sorry to hear that you are having issues with device shadow. I would like to understand a bit more this problematic:

    - Which protocol are you using?
    - Are you able to recreate this problem locally?

    Regards,

    Pascal.

  • Hi  ,

    We're using the default MQTT setup for device communication:
    CONFIG_NRF_CLOUD_MQTT=y
    CONFIG_NRF_CLOUD_MQTT_SHADOW_TRANSFORMS=y

    and we handle the shadow events as per the Nordic examples:
    NRF_CLOUD_EVT_RX_DATA_SHADOW
    NRF_CLOUD_EVT_TX_DATA_SHADOW

    >Are you able to recreate this problem locally?

    We can recreate the problem reliably on every device, but not only locally of course, because the cloud must send a shadow update that is larger than 2k to trigger the problem.

    I think you should be able to confirm this yourself fairly easily:

    1. Extend a device shadow with an application-specific "reported" section containing a number of key/value structures (like 15 structures sized 70 bytes each, which is about 1 KB in size).

    2. Then, on the cloud, issue a "desired" update where most of these key/value structures are changed --> this results in a large shadow delta.

    3. The cloud will send a large shadow delta update message to the device. 

    4. The device is forced to receive this message. Depending on the structure and changes in the shadow, the message will likely exceeds the 2k TLS modem buffer and the device disconnects immediately

    5. Since the shadow message was not acknowledged it remains in the network queue, so every time the device reconnects it will re-receive the message and disconnect again.

    Thanks

  • Hello,

    Thanks for the input, I will try to recreate the problem and let you know if I managed to do it.

    Regards,

    Pascal.

  • Thank you. Please let me know how I can help. This issue is causing us a fair amount of pain, and it will get worse as we roll out more devices. 

Reply Children
  • Hello,

    I wasn't able to reproduce the issue as you mentioned. The way I tried to reproduce it was as follow:

    1. Used the multi-service sample with NCS v3.2.99-85b0c61de8ff
    2. Modified the shadow_config.c file based on your recommendations with the next code:
      /* Copyright (c) 2024 Nordic Semiconductor ASA
       *
       * SPDX-License-Identifier: LicenseRef-Nordic-5-Clause
       */
      
      #include <zephyr/kernel.h>
      #include <zephyr/logging/log.h>
      #include <zephyr/logging/log_ctrl.h>
      #include <net/nrf_cloud_defs.h>
      
      #include "shadow_config.h"
      #if defined(CONFIG_NRF_CLOUD_COAP)
      #include "shadow_support_coap.h"
      #endif
      #include "application.h"
      
      LOG_MODULE_REGISTER(shadow_config, CONFIG_MULTI_SERVICE_LOG_LEVEL);
      
      #define TEST_COUNTER_EN	"counterEnable"
      #define APP_DATA_KEY	"appData"
      
      /* Application data structure keys (15 modules, ~70 bytes each, totaling ~1 KB) */
      #define APP_MODULE_1		"module_1"
      #define APP_MODULE_2		"module_2"
      #define APP_MODULE_3		"module_3"
      #define APP_MODULE_4		"module_4"
      #define APP_MODULE_5		"module_5"
      #define APP_MODULE_6		"module_6"
      #define APP_MODULE_7		"module_7"
      #define APP_MODULE_8		"module_8"
      #define APP_MODULE_9		"module_9"
      #define APP_MODULE_10		"module_10"
      #define APP_MODULE_11		"module_11"
      #define APP_MODULE_12		"module_12"
      #define APP_MODULE_13		"module_13"
      #define APP_MODULE_14		"module_14"
      #define APP_MODULE_15		"module_15"
      
      /* Flag to indicate that the accepted shadow data has been received (MQTT only) */
      static bool accepted_rcvd;
      
      static int add_cfg_data(struct nrf_cloud_obj *const cfg_obj)
      {
      	/* Add the test counter state */
      	return nrf_cloud_obj_bool_add(cfg_obj, TEST_COUNTER_EN, test_counter_enable_get(), false);
      }
      
      static int add_app_data(struct nrf_cloud_obj *const app_obj)
      {
      	int err = 0;
      	const char *module_keys[] = {
      		APP_MODULE_1, APP_MODULE_2, APP_MODULE_3, APP_MODULE_4, APP_MODULE_5,
      		APP_MODULE_6, APP_MODULE_7, APP_MODULE_8, APP_MODULE_9, APP_MODULE_10,
      		APP_MODULE_11, APP_MODULE_12, APP_MODULE_13, APP_MODULE_14, APP_MODULE_15
      	};
      
      	/* Add 15 application-specific structures (~70 bytes per module, totaling ~1 KB) */
      	for (int i = 0; i < 15; i++) {
      		NRF_CLOUD_OBJ_JSON_DEFINE(module_obj);
      
      		if (nrf_cloud_obj_init(&module_obj)) {
      			LOG_ERR("Failed to initialize module object %d", i + 1);
      			return -ENOMEM;
      		}
      
      		/* Add module-specific key/value fields (~70 bytes per structure) */
      		err = nrf_cloud_obj_num_add(&module_obj, "id", (double)(i + 1), false);
      		if (err) {
      			nrf_cloud_obj_free(&module_obj);
      			return err;
      		}
      
      		err = nrf_cloud_obj_bool_add(&module_obj, "active", true, false);
      		if (err) {
      			nrf_cloud_obj_free(&module_obj);
      			return err;
      		}
      
      		err = nrf_cloud_obj_num_add(&module_obj, "version", (double)(100 + i), false);
      		if (err) {
      			nrf_cloud_obj_free(&module_obj);
      			return err;
      		}
      
      		err = nrf_cloud_obj_num_add(&module_obj, "status", (double)(i % 3), false);
      		if (err) {
      			nrf_cloud_obj_free(&module_obj);
      			return err;
      		}
      
      		err = nrf_cloud_obj_num_add(&module_obj, "counter", (double)(i * 100), false);
      		if (err) {
      			nrf_cloud_obj_free(&module_obj);
      			return err;
      		}
      
      		err = nrf_cloud_obj_num_add(&module_obj, "timestamp", (double)k_uptime_get(), false);
      		if (err) {
      			nrf_cloud_obj_free(&module_obj);
      			return err;
      		}
      
      		/* Add this module object to the main app data object */
      		err = nrf_cloud_obj_object_add(app_obj, module_keys[i], &module_obj, false);
      		if (err) {
      			nrf_cloud_obj_free(&module_obj);
      			return err;
      		}
      	}
      
      	return err;
      }
      
      static int process_cfg(struct nrf_cloud_obj *const cfg_obj)
      {
      	bool val;
      	int err = nrf_cloud_obj_bool_get(cfg_obj, TEST_COUNTER_EN, &val);
      
      	if (err == 0) {
      		/* The expected key/value was found, set the test counter enable state */
      		test_counter_enable_set(val);
      	} else if (err == -ENOMSG) {
      		/* The key was found, but the value was not a boolean */
      		LOG_WRN("Invalid configuration value");
      		/* Reject the config */
      		err = -EBADF;
      	} else {
      		LOG_DBG("Expected data not found in config object");
      		err = -ENOMSG;
      	}
      
      	return err;
      }
      
      static int send_config(void)
      {
      	int err;
      
      	NRF_CLOUD_OBJ_JSON_DEFINE(root_obj);
      	NRF_CLOUD_OBJ_JSON_DEFINE(state_obj);
      	NRF_CLOUD_OBJ_JSON_DEFINE(reported_obj);
      	NRF_CLOUD_OBJ_JSON_DEFINE(cfg_obj);
      	NRF_CLOUD_OBJ_JSON_DEFINE(app_obj);
      
      	if (nrf_cloud_obj_init(&cfg_obj) || nrf_cloud_obj_init(&reported_obj) ||
      	    nrf_cloud_obj_init(&state_obj) || nrf_cloud_obj_init(&root_obj) || 
      		nrf_cloud_obj_init(&app_obj)) {
      		err = -ENOMEM;
      		goto cleanup;
      	}
      
      	/* Add the supported configuration data */
      	err = add_cfg_data(&cfg_obj);
      	if (err) {
      		goto cleanup;
      	}
      
      #if defined(CONFIG_NRF_CLOUD_MQTT)
      
      	/* Add application-specific reported data */
      	err = add_app_data(&app_obj);
      	if (err) {
      		goto cleanup;
      	}
      
      	/* Add config to reported */
      	err = nrf_cloud_obj_object_add(&reported_obj, NRF_CLOUD_JSON_KEY_CFG, &cfg_obj, false);
      	if (err) {
      		goto cleanup;
      	}
      
      	/* Add app data to reported */
      	err = nrf_cloud_obj_object_add(&reported_obj, NRF_CLOUD_JSON_KEY_CFG, &app_obj, false);
      	if (err) {
      		goto cleanup;
      	}
      
      	/* Add reported to state */
      	err = nrf_cloud_obj_object_add(&state_obj, NRF_CLOUD_JSON_KEY_REP, &reported_obj, false);
      	if (err) {
      		goto cleanup;
      	}
      
      	/* Add state to the root object */
      	err = nrf_cloud_obj_object_add(&root_obj, NRF_CLOUD_JSON_KEY_STATE, &state_obj, false);
      	if (err) {
      		goto cleanup;
      	}
      
      	/* Send to the cloud */
      	err = nrf_cloud_obj_shadow_update(&root_obj);
      #else  /* CONFIG_NRF_CLOUD_COAP */
      
      	/* Add config to root */
      	err = nrf_cloud_obj_object_add(&root_obj, NRF_CLOUD_JSON_KEY_CFG, &cfg_obj, false);
      	if (err) {
      		goto cleanup;
      	}
      
      	err = shadow_support_coap_obj_send(&root_obj, false);
      #endif
      
      cleanup:
      	nrf_cloud_obj_free(&root_obj);
      	nrf_cloud_obj_free(&state_obj);
      	nrf_cloud_obj_free(&reported_obj);
      	nrf_cloud_obj_free(&cfg_obj);
      	nrf_cloud_obj_free(&app_obj);
      	return err;
      }
      
      void shadow_config_cloud_connected(void)
      {
      	accepted_rcvd = false;
      }
      
      int shadow_config_reported_send(void)
      {
      	LOG_INF("Sending reported configuration");
      
      	int err = send_config();
      
      	if (err) {
      		LOG_ERR("Failed to send configuration, error: %d", err);
      	}
      
      	return err;
      }
      
      int shadow_config_delta_process(struct nrf_cloud_obj *const delta_obj)
      {
      	if (!delta_obj) {
      		return -EINVAL;
      	}
      
      	if ((delta_obj->type != NRF_CLOUD_OBJ_TYPE_JSON) || !delta_obj->json) {
      		/* No state JSON */
      		return -ENOMSG;
      	}
      
      	/* If there is a pending delta event when the device establishes a cloud connection
      	 * it is possible that it will be received before the accepted shadow data.
      	 * Do not process a delta event until the accepted shadow data has been received.
      	 * This is only a concern for MQTT.
      	 */
      	if (!accepted_rcvd && IS_ENABLED(CONFIG_NRF_CLOUD_MQTT)) {
      		return -EAGAIN;
      	}
      
      	int err;
      
      	NRF_CLOUD_OBJ_JSON_DEFINE(cfg_obj);
      
      	/* Get the config object */
      	err = nrf_cloud_obj_object_detach(delta_obj, NRF_CLOUD_JSON_KEY_CFG, &cfg_obj);
      	if (err == -ENODEV) {
      		/* No config in the delta */
      		return -ENOMSG;
      	}
      
      	/* Process the configuration */
      	err = process_cfg(&cfg_obj);
      
      	if (err == -EBADF) {
      		/* Clear incoming config and replace it with a good one */
      		nrf_cloud_obj_free(&cfg_obj);
      		nrf_cloud_obj_init(&cfg_obj);
      		if (add_cfg_data(&cfg_obj)) {
      			LOG_ERR("Failed to create delta response");
      		}
      	}
      
      	/* Add the config object back into the state so the response can be sent */
      	if (nrf_cloud_obj_object_add(delta_obj, NRF_CLOUD_JSON_KEY_CFG, &cfg_obj, false)) {
      		nrf_cloud_obj_free(&cfg_obj);
      		return -ENOMEM;
      	}
      
      	return err;
      }
      
      int shadow_config_accepted_process(struct nrf_cloud_obj *const accepted_obj)
      {
      	if (!accepted_obj) {
      		return -EINVAL;
      	}
      
      	/* The accepted shadow has been received */
      	accepted_rcvd = true;
      
      	if ((accepted_obj->type != NRF_CLOUD_OBJ_TYPE_JSON) || !accepted_obj->json) {
      		/* No config JSON */
      		return -ENOMSG;
      	}
      
      	return process_cfg(accepted_obj);
      }
      
    3. Enabled the Shadow Transforms and debug log level for nRF_Cloud module in the prj.conf
      CONFIG_NRF_CLOUD_MQTT_SHADOW_TRANSFORMS=y
      CONFIG_NRF_CLOUD_LOG_LEVEL_DBG=y
    4. Compiled it targeted for a nRF9151DK
      west build -p -b nrf9151dk/nrf9151/ns
    5. Onboard it to my account.
    6. Changed the logLvl to a different value so it is reported to the device the desired value.

    As I couldn't see any issues neither reset in the serial log, I proceeded to update the Modem Firmware from 2.0.2 to the latest 2.0.4. The updated succeed and no issues were reported in the serial log. 

    Let me know if I missed any step to try to reproduce your issue.

    Regards,

    Pascal.

  • Hi  ,

    Thanks. When you write

    > Changed the logLvl to a different value so it is reported to the device the desired value.

    Sorry, could you explain? Did you only change logLv in the desired or any of the other app properties?

    The problem is only triggered if the cloud issues a large shadow delta update message to the device. Typically, that means multiple of the properties (APP_MODULE_1, _2, _3, etc) must be changed in desired *at the same time*. Changing only one property (like logLvl) will not create a large enough shadow delta message. Or maybe I misunderstand what you tested.

    Thanks 

  • Hello,

    The logLvl can be changed here: 

    A few questions:

    Which mechanism are you using to change the shadow properties?

    Are you using the nRF9151? Which Modem firmware? Which NCS version?

    Regards,

    Pascal.

  • Hi,

    Ok, logLvl, I see. Thanks.

    As I mentioned, you have to change several properties in desired at once to trigger this.

    You can use any method to update "desired", for example the nRF Cloud UI "View Config" functionality. There, in the "desired" change a number of properties, for example 10 properties with 4 key/value-pairs each:

            "config": {
                "cfg": {
                    "property0": {
                        "property0-key1": "value11",
                        "property0-key2": "value12",
                        "property0-key3": "value13",
                        "property0-key4": "value14"
                    },
                    "property2": {
                        "property2-key1": "value21",
                        "property2-key2": "value22",
                        "property2-key3": "value23",
                        "property2-key4": "value24"
                    },
                    ...
                    "property9": {
                        "property9-key1": "value91",
                        "property9-key2": "value92",
                        "property9-key3": "value93",
                        "property9-key4": "value94"
                    }
                }
            }

    Then, "Commit".

    This "desired" is very different from "reported", so the cloud will now issue a large shadow delta update to the device, which should trigger the problem (it doesn't really matter what the current "reported" of the device is, if "desired" is large and different the problem will trigger)

    Our current HW/SW:

    • nRF SDK 2.9.0-7787b2649840
    • nRF9151 LACA A0A
    • mfw_nrf91x1_2.0.2
      Thanks
  • Hello,

    I was able to reproduce the issue. We are investigating a solution, I will contact you as soon we know something.

    Regards,

    Pascal.

Related