LwM2M: GET Requests Experiencing Timeouts After a Certain Duration

Hello,

we are currently in the process of verifying the nRF LwM2M Client, specifically focusing on the Queue Mode functionality. During our testing, we encountered an issue.

We configured the LWM2M_QUEUE_MODE_UPTIME parameter to 90 seconds. However, we noticed that after only 30 seconds, all GET operations from the server begin to time out.

For testing we used our own code and the official Nordic sample. The results were the same.

Here are the modifications we made to the official Nordic sample:

diff --git a/samples/cellular/lwm2m_client/prj.conf b/samples/cellular/lwm2m_client/prj.conf
index 800cf3aa8..a0c1f5a2e 100644
--- a/samples/cellular/lwm2m_client/prj.conf
+++ b/samples/cellular/lwm2m_client/prj.conf
@@ -88,7 +88,7 @@ CONFIG_FOTA_DOWNLOAD=y
 CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION="1.0.0"
 
 # Set LwM2M Server IP address here
-CONFIG_LWM2M_CLIENT_UTILS_SERVER="coaps://leshan.eclipseprojects.io:5684"
+CONFIG_LWM2M_CLIENT_UTILS_SERVER="coap://51.138.77.106:5687"
 
 # Application Event Manager
 CONFIG_APP_EVENT_MANAGER=y
@@ -106,21 +106,21 @@ CONFIG_LWM2M_DTLS_CID=y
 # When DTLS CID is used, we can keep the socket open.
 # If the server is not supporting CID, CONFIG_LWM2M_RD_CLIENT_SUSPEND_SOCKET_AT_IDLE should
 # be used instead.
-CONFIG_LWM2M_RD_CLIENT_STOP_POLLING_AT_IDLE=y
-
+#CONFIG_LWM2M_RD_CLIENT_STOP_POLLING_AT_IDLE=y
+CONFIG_LWM2M_RD_CLIENT_LISTEN_AT_IDLE=y
 # Enable TLS session caching to prevent doing a full TLS handshake when recovering the session
 CONFIG_LWM2M_TLS_SESSION_CACHING=y
 
 # Sets the duration that the lwm2m engine will be polling for data after transmission before
 # the socket is closed.
 # Adjust so that we can detach from network in 30 seconds
-CONFIG_LWM2M_QUEUE_MODE_UPTIME=30
+CONFIG_LWM2M_QUEUE_MODE_UPTIME=61
 
 # Set lifetime of 12 hours
 CONFIG_LWM2M_ENGINE_DEFAULT_LIFETIME=43200
 
 # Do registration update after 5400 seconds (90 minutes)
-CONFIG_LWM2M_UPDATE_PERIOD=5400
+CONFIG_LWM2M_UPDATE_PERIOD=59
 CONFIG_LWM2M_SECONDS_TO_UPDATE_EARLY=60
 
 # Configure PSM mode
diff --git a/samples/cellular/lwm2m_client/src/main.c b/samples/cellular/lwm2m_client/src/main.c
index fe1c00010..24e936268 100644
--- a/samples/cellular/lwm2m_client/src/main.c
+++ b/samples/cellular/lwm2m_client/src/main.c
@@ -65,7 +65,7 @@ static enum client_state {
 	NETWORK_ERROR	/* Client network error handling. Client stop and modem reset */
 } client_state = START;
 
-static uint8_t endpoint_name[ENDPOINT_NAME_LEN + 1];
+static uint8_t endpoint_name[ENDPOINT_NAME_LEN + 1]={"localhost123"};
 static uint8_t imei_buf[IMEI_LEN + sizeof("\r\nOK\r\n")];
 static struct lwm2m_ctx client = {0};
 static bool reconnect;
@@ -601,9 +601,9 @@ int main(void)
 	}
 
 	/* use IMEI as unique endpoint name */
-	snprintk(endpoint_name, sizeof(endpoint_name), "%s%.*s", CONFIG_APP_ENDPOINT_PREFIX,
-		 IMEI_LEN, imei_buf);
-	LOG_INF("endpoint: %s", (char *)endpoint_name);
+	// snprintk(endpoint_name, sizeof(endpoint_name), "%s%.*s", CONFIG_APP_ENDPOINT_PREFIX,
+	// 	 IMEI_LEN, imei_buf);
+	// LOG_INF("endpoint: %s", (char *)endpoint_name);
 
 	/* Setup LwM2M */
 	ret = lwm2m_setup();

To investigate this issue further, we used Wireshark to capture traffic on both the server and client sides. Below are our observations:

In the screenshot below, we can see the Wireshark logs from the server side:

  • The client successfully registers with the LwM2M server.
  • A GET operation is performed 15 seconds after the last activity, which completes successfully.
  • A second GET operation is attempted 36 seconds after the last activity, but it times out.
  • Subsequent re-registrations with the LwM2M server are successful.

The server appears to be functioning as per the LwM2M protocol specifications.

lwm2m_issue_serverside_v4.pcapng

In the first screenshot below, we can see the client successfully registering with the LwM2M server.

In the second screenshot, the server performs a GET operation 15 seconds after the last activity, which completes successfully.

In the third screenshot, we observe the modem waking up, likely triggered by downlink messaging. However, no UDP packets are received. This event coincides with the server's transmission 36 seconds after the last activity.

For reference, here are the relevant timestamps from the server-side logs:

lwm2m_issue_clientside_v2.pcapng

The root cause of this issue is still unclear.

Initially, I considered that it might be related to CG-NAT timing out the UDP tunnel, as our operator assigns us an IP address from the CG-NAT address space (see details below).

+CGDCONT: 0,"IPV4V6","omnitel","100.85.211.247 2A00:1EB8:C1DF:C63F:0000:0000:490E:94AC",0,0
OK

However, if CG-NAT were the cause, it seems unlikely that the network would still signal to the modem that there is pending downlink traffic.

Thoughts on what it might be related to?

Parents
  • > likely triggered by downlink messaging.

    Maybe you verify that by tests without "GET 2"? If you see that only if you use GET 2 but not if you don't use GET 2, then the assumption will be verified. Otherwise it's just a NAT issue.

  • As noted above, we’ve observed that sending a message within 30 seconds of a previous operation—such as a registration or a GET request—consistently succeeds. The issue appears to be that the network interrupts the DRX cycle to signal downlink traffic, but no data actually arrives. My assumption is that if CG-NAT were dropping the tunnel, we wouldn’t be woken up at all. Could you clarify if this interpretation is accurate?

    Alternatively, if this behavior is intentional, do you have experience regarding how long CG-NAT typically keeps the tunnel open? If the timeout is as short as 30 seconds, it would appear that LwM2M using CoAP may not be a suitable choice for Device Management; yet it seems to be quite popular in IoT. Confused

  • > The issue appears to be that the network interrupts the DRX cycle to signal downlink traffic, but no data actually arrives. 

    That's the assumption. But what happens, if the "second GET" (after the longer quiet phase) is not used? If you still see the RRC stuff without data, then it's not caused by downlink data.

  • > using CoAP may not be a suitable choice for Device Management

    That depends much more on your assumptions. In general, if a device is power constraint, it's important to have short active times and long sleeping times. Also, if the device communicates in challenging radio environments, it's important to use less and short radio messages, otherwise the probability to fail is raising.

    Systems, which "requires" to push data to the device, will be in conflict with the "long sleep". But if your device isn't power constraint (and maybe the SIM card is also not data-volume constraint), then you may use what ever protocol you want and push data to that always awake device at that "high costs".

    But in many use-cases, the trade off between that costs (energy+data) and that ability to "push" data to the device, does not go for the "push". In quite a lot of use-cases it's OK, that the device always initiates the communication and the server is only able to send data back for a short interval, not because of the NAT (which also limits that), that's because of the energy consumption. You will need a definition, how long the device should wait for data, before it goes to sleep. And then you need to adapt the usage of RAI accordingly (signalling the modem what to expect). In that scenario, waiting for more than 30s seems to be a waste of energy in the first place.

    So, in my experience, quiet a lot systems are mainly using frequent "system alive/heartbeat messages" (e.g. every hour) send to the server to indicate the system's health. That offers then the server the chance to send something back. With that, a Thingy:91 runs from it's battery for a year, exchanging every hour a message.

  • That's the assumption. But what happens, if the "second GET" (after the longer quiet phase) is not used? If you still see the RRC stuff without data, then it's not caused by downlink data.

    If no data is sent from the LwM2M server, the DRX is not interrupted. We tested the following intervals: 15 seconds (data received), 31 seconds (no data), 45 seconds (no data), 60 seconds (no data), and 85 seconds (no data). In each case, DRX was interrupted, but only at the 15-second interval did we receive IP data. In all other cases no socket events were set.

  • So, if you don't send data, then "DRX is not interrupted". And if you send data, it's interrupted but after a quiet phase the modem doesn't receive the data?

    Then I guess you will need a modem-trace (not only the ip-capture) and someone from Nordic may need to check that.

Reply Children
Related