Help with LTE connection after loss of connection

We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection. We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network. the only way we can achieve a connection is if we do a sys_reboot() or a hard reboot with watchdog. Why is the reconnect attempts not enough? Why do we have to force a reboot? Attached are relevant logs.

Parents
  • More info: 

    • We are using NSC version 2.6.0 and mfw 1.3.6.
    • The sims are global (EU, Nordics, Baltics), see image of subscription details.We have mainly worked with Telia sims but some of our clients have tried using other with no improvement. 

  • Hi,

    Could you  and  please confirm that all your questions and logs are related to the same application (based on lwm2m_client in NCS v2.6.1)?

    Could you please provide more information about your application? What exactly do you try to achieve? What does your application do?

    We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection.

    Can you provide more information on how devices lose connection? Do they work normally and suddenly lose connection? How often does this happen and on how many devices? Where are your failing devices located?

    We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network.

    Why do you think that the issue might be related to SIM? How do you do deactivation/activation of the SIM?

    Matias Marti said:
    It looks like the EXCHANGE_LIFETIME is set to 247s (4 minutes 7s) in the code here. Is there any way we can modify this value? Or is there another way to "give up" the exchange earlier?

    Have you tried changing the value directly in the code?

    Best regards,
    Dejan

  • Try three things:

    First, cherry-pick the fix https://github.com/nrfconnect/sdk-nrf/pull/15338 so it does not stop & start the client, but instead allows it to recover normally.

    Then modify CoAP timeouts, so that it retries a bit faster and does not back-off so much

    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=4000
    CONFIG_COAP_RANDOMIZE_ACK_TIMEOUT=n
    CONFIG_COAP_BACKOFF_PERCENT=100
    CONFIG_COAP_MAX_RETRANSMIT=4

    Then, I have not seen verification whether the DTLS Connection-Identifier is used.
    Make sure that the Kconfig is set:

    CONFIG_LWM2M_DTLS_CID=y

    Then verify from the Wireshark capture that it is in fact enabled. This requires support form server side as well.

  • Also, if you insist that registration message (and update) should have shorter CoAP timeout, you can try modifying the code to set that:

    ❯ git diff
      diff --git a/subsys/net/lib/lwm2m/lwm2m_rd_client.c b/subsys/net/lib/lwm2m/lwm2m_rd_client.c
      index 22ce51e5248..6da8bac18db 100644
      --- a/subsys/net/lib/lwm2m/lwm2m_rd_client.c
      +++ b/subsys/net/lib/lwm2m/lwm2m_rd_client.c
      @@ -865,6 +865,12 @@ static int sm_send_registration(bool send_obj_support_data,
                      goto cleanup;
              }
       
      +       msg->pending->params = (struct coap_transmission_parameters) {
      +               .ack_timeout = 1000,
      +               .coap_backoff_percent = 100,
      +               .max_retransmission = CONFIG_LWM2M_RD_CLIENT_MAX_RETRIES,
      +       };
      +
              ret = coap_packet_append_option(&msg->cpkt, COAP_OPTION_URI_PATH,
                                              LWM2M_RD_CLIENT_URI,
                                              strlen(LWM2M_RD_CLIENT_URI));
    

    but I don't really recommend that. One set of CoAP configuration should work for the whole layer, once it is tuned to work for the given connection.

  •   Thank you.

    It seems like Leshan/Californium does not support DTLS CID. 

    When using the following flags, we get the following application logs:

    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=4000
    CONFIG_COAP_RANDOMIZE_ACK_TIMEOUT=n
    CONFIG_COAP_BACKOFF_PERCENT=100
    CONFIG_COAP_MAX_RETRANSMIT=4

    [00:04:26.741,607] <inf> app_lwm2m_client: LwM2M is connected to server
    [00:04:26.741,668] <inf> app_lwm2m_client: Obtained date-time from modem
    [00:04:45.655,761] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:05:05.756,042] <wrn> net_lwm2m_rd_client: Registration Update Timeout
    [00:05:05.756,164] <dbg> app_lwm2m_client: rd_client_event: Registration update failure!
    [00:05:05.758,117] <inf> net_lwm2m_rd_client: RD Client started with endpoint '<MY_ENDPOINT>' with client lifetime 86400 using server object 0
    [00:05:05.877,807] <inf> app_lwm2m_client: LwM2M is connecting to server

    And these are the Wireshark logs:

    499	32.459415	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	89	Application Data
    500	43.816044	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    501	43.948368			LTE RRC DL_DCCH	26	RRCConnectionRelease [cause=other]
    502	45.203037			AT	23	Rcvd AT Command: +CSCON: 0  
    503	45.203159			AT	31	Rcvd AT Command: %XT3412: 10799999  
    504	45.237064			LTE RRC DL_SCH	60	SystemInformationBlockType1
    505	45.237614			LTE RRC DL_SCH	117	SystemInformation [ SIB2 SIB3 ]
    506	45.237797			LTE RRC DL_SCH	117	SystemInformation [ SIB5 ]
    507	53.817081			NAS-EPS	20	Service request
    508	53.817997			LTE RRC UL_CCCH	30	RRCConnectionRequest
    509	53.900516			LTE RRC DL_CCCH	71	RRCConnectionSetup
    510	53.901462			LTE RRC UL_DCCH/NAS-EPS	33	RRCConnectionSetupComplete, Service request
    511	53.902927			AT	24	Rcvd AT Command: %XT3412: 0  
    512	53.903049			AT	23	Rcvd AT Command: +CSCON: 1  
    513	53.986545			LTE RRC DL_DCCH	27	SecurityModeCommand
    514	53.988376			LTE RRC UL_DCCH	26	SecurityModeComplete
    515	53.995548			LTE RRC DL_DCCH	53	RRCConnectionReconfiguration
    516	53.996464			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    517	53.996799	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    518	54.045505			LTE RRC DL_DCCH	47	RRCConnectionReconfiguration
    519	54.045841			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    520	63.816746	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    521	63.948643			LTE RRC DL_DCCH	26	RRCConnectionRelease [cause=other]
    522	65.202518			AT	23	Rcvd AT Command: +CSCON: 0  
    523	65.202640			AT	31	Rcvd AT Command: %XT3412: 10799999  
    524	65.237552			LTE RRC DL_SCH	60	SystemInformationBlockType1
    525	65.238102			LTE RRC DL_SCH	117	SystemInformation [ SIB2 SIB3 ]
    526	65.238285			LTE RRC DL_SCH	117	SystemInformation [ SIB5 ]
    527	73.816990			NAS-EPS	20	Service request
    528	73.817936			LTE RRC UL_CCCH	30	RRCConnectionRequest
    529	73.901005			LTE RRC DL_CCCH	71	RRCConnectionSetup
    530	73.901951			LTE RRC UL_DCCH/NAS-EPS	33	RRCConnectionSetupComplete, Service request
    531	73.985020			LTE RRC DL_DCCH	27	SecurityModeCommand
    532	73.986759			LTE RRC UL_DCCH	26	SecurityModeComplete
    533	73.994053			LTE RRC DL_DCCH	53	RRCConnectionReconfiguration
    534	73.994999			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    535	73.995334	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    536	74.045994			LTE RRC DL_DCCH	47	RRCConnectionReconfiguration
    537	74.046329			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    538	83.810642	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	95	Application Data
    539	93.817447	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	71	Encrypted Alert
    540	93.818241			AT	20	Sent AT Command: AT+CMEE?
    541	93.818271			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    542	93.818851			AT	21	Sent AT Command: AT+CMEE=1
    543	93.818882			AT	16	Rcvd AT Command: OK  
    544	93.819523			AT	19	Sent AT Command: AT%CMNG
    545	93.849064			AT	21	Rcvd AT Command: Sensitive
    546	93.849552			AT	21	Sent AT Command: AT+CMEE=0
    547	93.849552			AT	16	Rcvd AT Command: OK  
    548	93.849979			AT	20	Sent AT Command: AT+CMEE?
    549	93.850010			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    550	93.850406			AT	21	Sent AT Command: AT+CMEE=1
    551	93.850437			AT	16	Rcvd AT Command: OK  
    552	93.850864			AT	19	Sent AT Command: AT%CMNG
    553	93.880253			AT	21	Rcvd AT Command: Sensitive
    554	93.880741			AT	21	Sent AT Command: AT+CMEE=0
    555	93.880741			AT	16	Rcvd AT Command: OK  
    556	93.881168			AT	20	Sent AT Command: AT+CMEE?
    557	93.881199			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    558	93.887760			AT	21	Sent AT Command: AT+CMEE=1
    559	93.887791			AT	16	Rcvd AT Command: OK  
    560	93.888187			AT	19	Sent AT Command: AT%CMNG
    561	93.917606			AT	21	Rcvd AT Command: Sensitive
    562	93.918094			AT	21	Sent AT Command: AT+CMEE=0
    563	93.918094			AT	16	Rcvd AT Command: OK  
    564	93.918522			AT	20	Sent AT Command: AT+CMEE?
    565	93.918522			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    566	93.918979			AT	21	Sent AT Command: AT+CMEE=1
    567	93.919010			AT	16	Rcvd AT Command: OK  
    568	93.919437			AT	19	Sent AT Command: AT%CMNG
    569	93.949375			LTE RRC DL_DCCH	26	RRCConnectionRelease [cause=other]
    570	93.954288			AT	21	Rcvd AT Command: Sensitive
    571	93.954716			AT	21	Sent AT Command: AT+CMEE=0
    572	93.954746			AT	16	Rcvd AT Command: OK  
    573	95.203281			AT	23	Rcvd AT Command: +CSCON: 0  
    574	95.203403			AT	31	Rcvd AT Command: %XT3412: 10799999  
    575	95.238285			LTE RRC DL_SCH	60	SystemInformationBlockType1
    576	95.238834			LTE RRC DL_SCH	117	SystemInformation [ SIB2 SIB3 ]
    577	95.239048			LTE RRC DL_SCH	117	SystemInformation [ SIB5 ]
    578	95.240787			NAS-EPS	20	Service request
    579	95.249241			LTE RRC UL_CCCH	30	RRCConnectionRequest
    580	95.280521			LTE RRC DL_CCCH	71	RRCConnectionSetup
    581	95.281467			LTE RRC UL_DCCH/NAS-EPS	33	RRCConnectionSetupComplete, Service request
    582	95.303531			AT	24	Rcvd AT Command: %XT3412: 0  
    583	95.340549			LTE RRC DL_DCCH	27	SecurityModeCommand
    584	95.342289			LTE RRC UL_DCCH	26	SecurityModeComplete
    585	95.349552			LTE RRC DL_DCCH	53	RRCConnectionReconfiguration
    586	95.350498			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    587	95.350864	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	170	Client Hello
    588	95.350986	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	170	Client Hello
    589	95.403690			AT	23	Rcvd AT Command: +CSCON: 1  
    590	95.417606			LTE RRC DL_DCCH	47	RRCConnectionReconfiguration
    591	95.417942			LTE RRC UL_DCCH	26	RRCConnectionReconfigurationComplete
    592	95.434482	<REMOTE_IP>	<LOCAL_IP>	DTLSv1.2	81	Application Data
    593	95.434635	<LOCAL_IP>	<REMOTE_IP>	ICMP	68	Destination unreachable (Port unreachable)
    594	95.483524	<REMOTE_IP>	<LOCAL_IP>	DTLSv1.2	202	Server Hello, Change Cipher Spec, Encrypted Handshake Message
    595	95.490848	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	107	Change Cipher Spec, Encrypted Handshake Message
    596	95.495365	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	605	Application Data
    597	95.496586	<LOCAL_IP>	<REMOTE_IP>	DTLSv1.2	71	Encrypted Alert
    598	95.603916			AT	20	Sent AT Command: AT+CMEE?
    599	95.603947			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    600	95.604496			AT	21	Sent AT Command: AT+CMEE=1
    601	95.604526			AT	16	Rcvd AT Command: OK  
    602	95.604923			AT	19	Sent AT Command: AT%CMNG
    603	95.646122			AT	21	Rcvd AT Command: Sensitive
    604	95.646763			AT	21	Sent AT Command: AT+CMEE=0
    605	95.646793			AT	16	Rcvd AT Command: OK  
    606	95.647129			AT	20	Sent AT Command: AT+CMEE?
    607	95.647159			AT	26	Rcvd AT Command: +CMEE: 0  OK  
    608	95.647770			AT	21	Sent AT Command: AT+CMEE=1
    609	95.647800			AT	16	Rcvd AT Command: OK  
    610	95.648441			AT	19	Sent AT Command: AT%CMNG
    611	95.651462	<REMOTE_IP>	<LOCAL_IP>	DTLSv1.2	71	Encrypted Alert
    612	95.651645	<LOCAL_IP>	<REMOTE_IP>	ICMP	68	Destination unreachable (Port unreachable)
    613	95.689792			AT	21	Rcvd AT Command: Sensitive
    614	95.690525			AT	21	Sent AT Command: AT+CMEE=0
    615	95.690525			AT	16	Rcvd AT Command: OK  

    However, when we set CONFIG_COAP_INIT_ACK_TIMEOUT_MS=10000 , the board manages to perform registration updates, whereas it fails constantly when the value is set to 4000. What would be next step forward?

  • [00:00:17.177,490] <inf> app_lwm2m_client: LwM2M is connected to server
    [00:00:17.177,551] <inf> app_lwm2m_client: Obtained date-time from modem
    [00:00:36.092,285] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:01:16.382,965] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:01:16.383,026] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:01:16.383,056] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:01:16.383,148] <inf> net_lwm2m_rd_client: Update Done
    [00:01:35.383,056] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:02:15.828,491] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:02:15.828,552] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:02:15.828,582] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:02:15.828,643] <inf> net_lwm2m_rd_client: Update Done

    It takes up to 40s for the registration update to complete when we set these config values:

    CONFIG_COAP_RANDOMIZE_ACK_TIMEOUT=n
    CONFIG_COAP_BACKOFF_PERCENT=100
    CONFIG_COAP_MAX_RETRANSMIT=4

    We are also getting the following logs:

    [00:02:31.944,396] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:02:31.944,488] <inf> net_lwm2m_rd_client: Update Done
    [00:02:32.430,450] <dbg> app_lwm2m_client: rd_client_event: Queue mode RX window closed
    [00:02:50.944,335] <dbg> app_lwm2m_client: rd_client_event: Registration update started

    When the server is slow to respond after an initial registration update, it seems like the queue mode RX gets closed right before the next registration update starts. Could this be some sort of race condition?

    When our Leshan server responds, the logs just stay quiet until the watchdog kicks in.

  • Then if the DTLS CID is not enabled in the server side, it might cause these connection issues.

    When NAT timeout to happens, the IP&PORT pair mapping will change and it causes Leshan to ignore the LwM2M update messages because they come from unknown port. This offcourse only happens, if LwM2M engine is not instructed to do a DTLS Session resumption when doing a LwM2M Update.

    My first recommendation would be to update Leshan, but with current environment, try:

    # Do a DTLS Session resumption when coming
    # back from RX_OFF state.
    CONFIG_LWM2M_RD_CLIENT_SUSPEND_SOCKET_AT_IDLE=y
    CONFIG_LWM2M_TLS_SESSION_CACHING=y
    
    # Fine-tune the timeout to match NAT timeout on
    # current network. I have seen 20s on some LTE networks in Finland.
    # Try 10, 15, 20, 30
    CONFIG_LWM2M_QUEUE_MODE_ENABLED=y
    CONFIG_LWM2M_QUEUE_MODE_UPTIME=15

Reply
  • Then if the DTLS CID is not enabled in the server side, it might cause these connection issues.

    When NAT timeout to happens, the IP&PORT pair mapping will change and it causes Leshan to ignore the LwM2M update messages because they come from unknown port. This offcourse only happens, if LwM2M engine is not instructed to do a DTLS Session resumption when doing a LwM2M Update.

    My first recommendation would be to update Leshan, but with current environment, try:

    # Do a DTLS Session resumption when coming
    # back from RX_OFF state.
    CONFIG_LWM2M_RD_CLIENT_SUSPEND_SOCKET_AT_IDLE=y
    CONFIG_LWM2M_TLS_SESSION_CACHING=y
    
    # Fine-tune the timeout to match NAT timeout on
    # current network. I have seen 20s on some LTE networks in Finland.
    # Try 10, 15, 20, 30
    CONFIG_LWM2M_QUEUE_MODE_ENABLED=y
    CONFIG_LWM2M_QUEUE_MODE_UPTIME=15

Children
  •    thanks again.

    Is there another way to prevent a NAT timeout? By setting  CONFIG_LWM2M_QUEUE_MODE_UPTIME=15 we are using all the layers of communication here below, right?

    On which of these layers does the NAT timeout occur? Is there a way to use your SDK so that it sends a message to the LTE network on the lowest possible layer? We would like to prevent NAT timeout without communicating all the way to the Leshan server.

    Any thoughts?

  • https://en.wikipedia.org/wiki/Network_address_translation#One-to-many_NAT
    NAT happens at IP layer.

    So in order to keep the current mapping active, you must send UDP packets through in the higher frequency than what the network routers NAT timeout is.

    So in a short: There is no way to prevent NAT timeout without sending packets.



    If you want to prevent the timeout, you can configure for example CONFIG_LWM2M_UPDATE_PERIOD to be a small enough so that it does not cause NAT timeouts.

    I would still recommend to keep the configs from previous example, so if for some reason, the timeout happens, next handshake would still be using session resumption.

    There is a risk on that kind of configuration however: If we configure UPDATE_PERIOD shorter than QUEUE_MODE_UPTIME it means that LwM2M engine is never in so called RX_OFF state, or one might call it QUEUE mode. So it assumes that socket is constantly active, and does not try DTLS Session resumption.
    If we end up sending LwM2M Update message into DTLS socket where NAT timeout have happened, it causes all DTLS packets to be ignore by the server. And then all the CoAP retry logic is just wasted time.

    If we instead assume that NAT timeouts might happen, and allow those to happen, and configure UPDATE_PERIOD to a longer value, for example several minutes or hours, then LwM2M Update causes engine to "resume" the DTLS connection, which means that it actually closes the socket and does a new handshake using DTLS Session resumption. This is almost like full DTLS-handshake but shorter. It is accepted by the server because it starts with normal DTLS Client-Hello.

    So any approach you take will consume a lot of bandwidth to keep the connection up.

    With DTLS Connection-Identifier you get rid of the issues with NAT timeouts because server side uses CID to identify connections instead of IP&port pair. Then NAT re-mapping does not interrupt the DTLS session, it only block the server from communicating to the device, until it does LwM2M Update which refreshes the IP&port mapping.

Related