Help with LTE connection after loss of connection

We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection. We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network. the only way we can achieve a connection is if we do a sys_reboot() or a hard reboot with watchdog. Why is the reconnect attempts not enough? Why do we have to force a reboot? Attached are relevant logs.

  • I also tried to replicate this behaviour on nRF9160DK and with nRF Connect SDK 2.6.0. In order to cause the problem, I had to turn off the DTLS Connection-Identifier so NAT timeout will happen.

    I think I found something which I just saw on your logs.

    uart:~$ lwm2m update
    [00:03:21.185,638] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:06:20.157,104] <wrn> net_lwm2m_rd_client: Registration Update Timeout
    [00:06:20.157,226] <inf> lwm2m_rai: RAI disabled
    [00:06:20.157,257] <dbg> app_lwm2m_client: rd_client_event: Registration update failure!
    [00:06:20.166,778] <inf> net_lwm2m_rd_client: RD Client started with endpoint 'urn:imei:351358814369747' with client lifetime 43200 using server object 0
    [00:06:21.534,088] <inf> net_lwm2m_engine: Connected, sock id 2
    [00:06:21.536,285] <dbg> app_lwm2m_client: rd_client_event: Disconnected
    [00:06:21.536,407] <inf> net_lwm2m_rd_client: Stop LWM2M Client: urn:imei:351358814369747
    [00:06:21.622,833] <inf> app_lwm2m_client: Client connect to server
    [00:06:21.622,894] <inf> net_lwm2m_rd_client: Start LWM2M Client: urn:imei:351358814369747
    [00:06:21.623,443] <inf> net_lwm2m_rd_client: RD Client started with endpoint 'urn:imei:351358814369747' with client lifetime 43200 using server object 0
    [00:06:21.748,229] <inf> app_lwm2m_client: LwM2M is connecting to server
    [00:06:22.174,072] <inf> net_lwm2m_engine: Connected, sock id 2
    [00:06:22.270,721] <dbg> app_lwm2m_client: rd_client_event: Registration complete
    [00:06:22.270,874] <inf> net_lwm2m_rd_client: Registration Done (EP='rxUng0f9N8')
    

    There is no reason to stop the client.
    This has been fixed in https://github.com/nrfconnect/sdk-nrf/pull/15338

    LwM2M client recovers from update timeout.

    Here is a log of the 2.6.0 + the fix

    uart:~$ lwm2m update
    [00:01:19.919,128] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:04:22.022,033] <wrn> net_lwm2m_rd_client: Registration Update Timeout
    [00:04:22.022,155] <inf> lwm2m_rai: RAI disabled
    [00:04:22.022,155] <dbg> app_lwm2m_client: rd_client_event: Registration update failure!
    [00:04:22.030,853] <inf> net_lwm2m_rd_client: RD Client started with endpoint 'urn:imei:351358814369747' with client lifetime 43200 using server object 0
    [00:04:22.169,616] <inf> app_lwm2m_client: LwM2M is connecting to server
    [00:04:23.408,752] <inf> net_lwm2m_engine: Connected, sock id 2
    [00:04:23.501,373] <dbg> app_lwm2m_client: rd_client_event: Registration complete
    [00:04:23.501,495] <inf> net_lwm2m_rd_client: Registration Done (EP='G2ut7hCAmu')
    [00:04:23.587,158] <inf> app_lwm2m_client: LwM2M is connected to server
    [00:04:23.587,219] <inf> app_lwm2m_client: Obtained date-time from modem
    [00:04:53.411,499] <inf> lwm2m_rai: RAI enabled
    [00:04:53.411,529] <dbg> app_lwm2m_client: rd_client_event: Queue mode RX window closed
    

    2024-10-14 nRF9160 LwM2M Update failure, NAT timeout.pcapng2024-10-14 nRF9160 LwM2M Update failure, NAT timeout plus fix.pcapng

    I'll try to attach the wireshark captures for you to check if this is the behaviour that you are seeing. PSK key is e0e7f1368e66c05f3511b5fa4671f56e

  • Thank you. It is great that the client recovers from an update timeout. However, our Leshan server considers the device to have fallen asleep if there has been no messages for 93s. How can we make sure that Leshan does not have to wait for the 4min7s (timeout)?

    Which COAP config value should we change to trigger a timeout earlier than 247s?

  • Try three things:

    First, cherry-pick the fix https://github.com/nrfconnect/sdk-nrf/pull/15338 so it does not stop & start the client, but instead allows it to recover normally.

    Then modify CoAP timeouts, so that it retries a bit faster and does not back-off so much

    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=4000
    CONFIG_COAP_RANDOMIZE_ACK_TIMEOUT=n
    CONFIG_COAP_BACKOFF_PERCENT=100
    CONFIG_COAP_MAX_RETRANSMIT=4

    Then, I have not seen verification whether the DTLS Connection-Identifier is used.
    Make sure that the Kconfig is set:

    CONFIG_LWM2M_DTLS_CID=y

    Then verify from the Wireshark capture that it is in fact enabled. This requires support form server side as well.

  • Also, if you insist that registration message (and update) should have shorter CoAP timeout, you can try modifying the code to set that:

    ❯ git diff
      diff --git a/subsys/net/lib/lwm2m/lwm2m_rd_client.c b/subsys/net/lib/lwm2m/lwm2m_rd_client.c
      index 22ce51e5248..6da8bac18db 100644
      --- a/subsys/net/lib/lwm2m/lwm2m_rd_client.c
      +++ b/subsys/net/lib/lwm2m/lwm2m_rd_client.c
      @@ -865,6 +865,12 @@ static int sm_send_registration(bool send_obj_support_data,
                      goto cleanup;
              }
       
      +       msg->pending->params = (struct coap_transmission_parameters) {
      +               .ack_timeout = 1000,
      +               .coap_backoff_percent = 100,
      +               .max_retransmission = CONFIG_LWM2M_RD_CLIENT_MAX_RETRIES,
      +       };
      +
              ret = coap_packet_append_option(&msg->cpkt, COAP_OPTION_URI_PATH,
                                              LWM2M_RD_CLIENT_URI,
                                              strlen(LWM2M_RD_CLIENT_URI));
    

    but I don't really recommend that. One set of CoAP configuration should work for the whole layer, once it is tuned to work for the given connection.

  •   Thank you.

    It seems like Leshan/Californium does not support DTLS CID. 

    When using the following flags, we get the following application logs:

    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=4000
    CONFIG_COAP_RANDOMIZE_ACK_TIMEOUT=n
    CONFIG_COAP_BACKOFF_PERCENT=100
    CONFIG_COAP_MAX_RETRANSMIT=4

    [00:04:26.741,607] <inf> app_lwm2m_client: LwM2M is connected to server
    [00:04:26.741,668] <inf> app_lwm2m_client: Obtained date-time from modem
    [00:04:45.655,761] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:05:05.756,042] <wrn> net_lwm2m_rd_client: Registration Update Timeout
    [00:05:05.756,164] <dbg> app_lwm2m_client: rd_client_event: Registration update failure!
    [00:05:05.758,117] <inf> net_lwm2m_rd_client: RD Client started with endpoint '<MY_ENDPOINT>' with client lifetime 86400 using server object 0
    [00:05:05.877,807] <inf> app_lwm2m_client: LwM2M is connecting to server

    And these are the Wireshark logs:

    499	32.459415	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	89	Application Data
    500	43.816044	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    501	43.948368			LTE RRC DL_DCCH	26	RRCConnectionRelease [cause=other]
    502	45.203037			AT	23	Rcvd AT Command: +CSCON: 0  
    503	45.203159			AT	31	Rcvd AT Command: %XT3412: 10799999  
    504	45.237064			LTE RRC DL_SCH	60	SystemInformationBlockType1
    505	45.237614			LTE RRC DL_SCH	117	SystemInformation [ SIB2 SIB3 ]
    506	45.237797			LTE RRC DL_SCH	117	SystemInformation [ SIB5 ]
    507	53.817081			NAS-EPS	20	Service request
    508	53.817997			LTE RRC UL_CCCH	30	RRCConnectionRequest
    509	53.900516			LTE RRC DL_CCCH	71	RRCConnectionSetup
    510	53.901462			LTE RRC UL_DCCH/NAS-EPS	33	RRCConnectionSetupComplete, Service request
    511	53.902927			AT	24	Rcvd AT Command: %XT3412: 0  
    512	53.903049			AT	23	Rcvd AT Command: +CSCON: 1  
    513	53.986545			LTE RRC DL_DCCH	27	SecurityModeCommand
    514	53.988376			LTE RRC UL_DCCH	26	SecurityModeComplete
    515	53.995548			LTE RRC DL_DCCH	53	RRCConnectionReconfiguration
    516	53.996464			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    517	53.996799	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    518	54.045505			LTE RRC DL_DCCH	47	RRCConnectionReconfiguration
    519	54.045841			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    520	63.816746	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    521	63.948643			LTE RRC DL_DCCH	26	RRCConnectionRelease [cause=other]
    522	65.202518			AT	23	Rcvd AT Command: +CSCON: 0  
    523	65.202640			AT	31	Rcvd AT Command: %XT3412: 10799999  
    524	65.237552			LTE RRC DL_SCH	60	SystemInformationBlockType1
    525	65.238102			LTE RRC DL_SCH	117	SystemInformation [ SIB2 SIB3 ]
    526	65.238285			LTE RRC DL_SCH	117	SystemInformation [ SIB5 ]
    527	73.816990			NAS-EPS	20	Service request
    528	73.817936			LTE RRC UL_CCCH	30	RRCConnectionRequest
    529	73.901005			LTE RRC DL_CCCH	71	RRCConnectionSetup
    530	73.901951			LTE RRC UL_DCCH/NAS-EPS	33	RRCConnectionSetupComplete, Service request
    531	73.985020			LTE RRC DL_DCCH	27	SecurityModeCommand
    532	73.986759			LTE RRC UL_DCCH	26	SecurityModeComplete
    533	73.994053			LTE RRC DL_DCCH	53	RRCConnectionReconfiguration
    534	73.994999			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    535	73.995334	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    536	74.045994			LTE RRC DL_DCCH	47	RRCConnectionReconfiguration
    537	74.046329			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    538	83.810642	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    539	93.817447	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	71	Encrypted Alert
    540	93.818241			AT	20	Sent AT Command: AT+CMEE?
    541	93.818271			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    542	93.818851			AT	21	Sent AT Command: AT+CMEE=1
    543	93.818882			AT	16	Rcvd AT Command: OK  
    544	93.819523			AT	19	Sent AT Command: AT%CMNG
    545	93.849064			AT	21	Rcvd AT Command: Sensitive
    546	93.849552			AT	21	Sent AT Command: AT+CMEE=0
    547	93.849552			AT	16	Rcvd AT Command: OK  
    548	93.849979			AT	20	Sent AT Command: AT+CMEE?
    549	93.850010			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    550	93.850406			AT	21	Sent AT Command: AT+CMEE=1
    551	93.850437			AT	16	Rcvd AT Command: OK  
    552	93.850864			AT	19	Sent AT Command: AT%CMNG
    553	93.880253			AT	21	Rcvd AT Command: Sensitive
    554	93.880741			AT	21	Sent AT Command: AT+CMEE=0
    555	93.880741			AT	16	Rcvd AT Command: OK  
    556	93.881168			AT	20	Sent AT Command: AT+CMEE?
    557	93.881199			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    558	93.887760			AT	21	Sent AT Command: AT+CMEE=1
    559	93.887791			AT	16	Rcvd AT Command: OK  
    560	93.888187			AT	19	Sent AT Command: AT%CMNG
    561	93.917606			AT	21	Rcvd AT Command: Sensitive
    562	93.918094			AT	21	Sent AT Command: AT+CMEE=0
    563	93.918094			AT	16	Rcvd AT Command: OK  
    564	93.918522			AT	20	Sent AT Command: AT+CMEE?
    565	93.918522			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    566	93.918979			AT	21	Sent AT Command: AT+CMEE=1
    567	93.919010			AT	16	Rcvd AT Command: OK  
    568	93.919437			AT	19	Sent AT Command: AT%CMNG
    569	93.949375			LTE RRC DL_DCCH	26	RRCConnectionRelease [cause=other]
    570	93.954288			AT	21	Rcvd AT Command: Sensitive
    571	93.954716			AT	21	Sent AT Command: AT+CMEE=0
    572	93.954746			AT	16	Rcvd AT Command: OK  
    573	95.203281			AT	23	Rcvd AT Command: +CSCON: 0  
    574	95.203403			AT	31	Rcvd AT Command: %XT3412: 10799999  
    575	95.238285			LTE RRC DL_SCH	60	SystemInformationBlockType1
    576	95.238834			LTE RRC DL_SCH	117	SystemInformation [ SIB2 SIB3 ]
    577	95.239048			LTE RRC DL_SCH	117	SystemInformation [ SIB5 ]
    578	95.240787			NAS-EPS	20	Service request
    579	95.249241			LTE RRC UL_CCCH	30	RRCConnectionRequest
    580	95.280521			LTE RRC DL_CCCH	71	RRCConnectionSetup
    581	95.281467			LTE RRC UL_DCCH/NAS-EPS	33	RRCConnectionSetupComplete, Service request
    582	95.303531			AT	24	Rcvd AT Command: %XT3412: 0  
    583	95.340549			LTE RRC DL_DCCH	27	SecurityModeCommand
    584	95.342289			LTE RRC UL_DCCH	26	SecurityModeComplete
    585	95.349552			LTE RRC DL_DCCH	53	RRCConnectionReconfiguration
    586	95.350498			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    587	95.350864	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	170	Client Hello
    588	95.350986	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	170	Client Hello
    589	95.403690			AT	23	Rcvd AT Command: +CSCON: 1  
    590	95.417606			LTE RRC DL_DCCH	47	RRCConnectionReconfiguration
    591	95.417942			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    592	95.434482	<REMOTE_IP>	<LOCAL_IP>	DTLSv1.2	81	Application Data
    593	95.434635	<LOCAL_IP>	<REMOTE_IP>	ICMP	68	Destination unreachable (Port unreachable)
    594	95.483524	<REMOTE_IP>	<LOCAL_IP>	DTLSv1.2	202	Server Hello, Change Cipher Spec, Encrypted Handshake Message
    595	95.490848	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	107	Change Cipher Spec, Encrypted Handshake Message
    596	95.495365	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	605	Application Data
    597	95.496586	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	71	Encrypted Alert
    598	95.603916			AT	20	Sent AT Command: AT+CMEE?
    599	95.603947			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    600	95.604496			AT	21	Sent AT Command: AT+CMEE=1
    601	95.604526			AT	16	Rcvd AT Command: OK  
    602	95.604923			AT	19	Sent AT Command: AT%CMNG
    603	95.646122			AT	21	Rcvd AT Command: Sensitive
    604	95.646763			AT	21	Sent AT Command: AT+CMEE=0
    605	95.646793			AT	16	Rcvd AT Command: OK  
    606	95.647129			AT	20	Sent AT Command: AT+CMEE?
    607	95.647159			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    608	95.647770			AT	21	Sent AT Command: AT+CMEE=1
    609	95.647800			AT	16	Rcvd AT Command: OK  
    610	95.648441			AT	19	Sent AT Command: AT%CMNG
    611	95.651462	<REMOTE_IP>	<LOCAL_IP>	DTLSv1.2	71	Encrypted Alert
    612	95.651645	<LOCAL_IP>	<REMOTE_IP>	ICMP	68	Destination unreachable (Port unreachable)
    613	95.689792			AT	21	Rcvd AT Command: Sensitive
    614	95.690525			AT	21	Sent AT Command: AT+CMEE=0
    615	95.690525			AT	16	Rcvd AT Command: OK  

    However, when we set CONFIG_COAP_INIT_ACK_TIMEOUT_MS=10000 , the board manages to perform registration updates, whereas it fails constantly when the value is set to 4000. What would be next step forward?

Related