Help with LTE connection after loss of connection

We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection. We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network. the only way we can achieve a connection is if we do a sys_reboot() or a hard reboot with watchdog. Why is the reconnect attempts not enough? Why do we have to force a reboot? Attached are relevant logs.

Parents
  • More info: 

    • We are using NSC version 2.6.0 and mfw 1.3.6.
    • The sims are global (EU, Nordics, Baltics), see image of subscription details.We have mainly worked with Telia sims but some of our clients have tried using other with no improvement. 

  • Hi,

    Could you  and  please confirm that all your questions and logs are related to the same application (based on lwm2m_client in NCS v2.6.1)?

    Could you please provide more information about your application? What exactly do you try to achieve? What does your application do?

    We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection.

    Can you provide more information on how devices lose connection? Do they work normally and suddenly lose connection? How often does this happen and on how many devices? Where are your failing devices located?

    We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network.

    Why do you think that the issue might be related to SIM? How do you do deactivation/activation of the SIM?

    Matias Marti said:
    It looks like the EXCHANGE_LIFETIME is set to 247s (4 minutes 7s) in the code here. Is there any way we can modify this value? Or is there another way to "give up" the exchange earlier?

    Have you tried changing the value directly in the code?

    Best regards,
    Dejan

  • [00:00:17.177,490] <inf> app_lwm2m_client: LwM2M is connected to server
    [00:00:17.177,551] <inf> app_lwm2m_client: Obtained date-time from modem
    [00:00:36.092,285] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:01:16.382,965] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:01:16.383,026] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:01:16.383,056] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:01:16.383,148] <inf> net_lwm2m_rd_client: Update Done
    [00:01:35.383,056] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:02:15.828,491] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:02:15.828,552] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:02:15.828,582] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:02:15.828,643] <inf> net_lwm2m_rd_client: Update Done

    It takes up to 40s for the registration update to complete when we set these config values:

    CONFIG_COAP_RANDOMIZE_ACK_TIMEOUT=n
    CONFIG_COAP_BACKOFF_PERCENT=100
    CONFIG_COAP_MAX_RETRANSMIT=4

    We are also getting the following logs:

    [00:02:31.944,396] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:02:31.944,488] <inf> net_lwm2m_rd_client: Update Done
    [00:02:32.430,450] <dbg> app_lwm2m_client: rd_client_event: Queue mode RX window closed
    [00:02:50.944,335] <dbg> app_lwm2m_client: rd_client_event: Registration update started

    When the server is slow to respond after an initial registration update, it seems like the queue mode RX gets closed right before the next registration update starts. Could this be some sort of race condition?

    When our Leshan server responds, the logs just stay quiet until the watchdog kicks in.

  • Then if the DTLS CID is not enabled in the server side, it might cause these connection issues.

    When NAT timeout to happens, the IP&PORT pair mapping will change and it causes Leshan to ignore the LwM2M update messages because they come from unknown port. This offcourse only happens, if LwM2M engine is not instructed to do a DTLS Session resumption when doing a LwM2M Update.

    My first recommendation would be to update Leshan, but with current environment, try:

    # Do a DTLS Session resumption when coming
    # back from RX_OFF state.
    CONFIG_LWM2M_RD_CLIENT_SUSPEND_SOCKET_AT_IDLE=y
    CONFIG_LWM2M_TLS_SESSION_CACHING=y
    
    # Fine-tune the timeout to match NAT timeout on
    # current network. I have seen 20s on some LTE networks in Finland.
    # Try 10, 15, 20, 30
    CONFIG_LWM2M_QUEUE_MODE_ENABLED=y
    CONFIG_LWM2M_QUEUE_MODE_UPTIME=15

  •    thanks again.

    Is there another way to prevent a NAT timeout? By setting  CONFIG_LWM2M_QUEUE_MODE_UPTIME=15 we are using all the layers of communication here below, right?

    On which of these layers does the NAT timeout occur? Is there a way to use your SDK so that it sends a message to the LTE network on the lowest possible layer? We would like to prevent NAT timeout without communicating all the way to the Leshan server.

    Any thoughts?

  • https://en.wikipedia.org/wiki/Network_address_translation#One-to-many_NAT
    NAT happens at IP layer.

    So in order to keep the current mapping active, you must send UDP packets through in the higher frequency than what the network routers NAT timeout is.

    So in a short: There is no way to prevent NAT timeout without sending packets.



    If you want to prevent the timeout, you can configure for example CONFIG_LWM2M_UPDATE_PERIOD to be a small enough so that it does not cause NAT timeouts.

    I would still recommend to keep the configs from previous example, so if for some reason, the timeout happens, next handshake would still be using session resumption.

    There is a risk on that kind of configuration however: If we configure UPDATE_PERIOD shorter than QUEUE_MODE_UPTIME it means that LwM2M engine is never in so called RX_OFF state, or one might call it QUEUE mode. So it assumes that socket is constantly active, and does not try DTLS Session resumption.
    If we end up sending LwM2M Update message into DTLS socket where NAT timeout have happened, it causes all DTLS packets to be ignore by the server. And then all the CoAP retry logic is just wasted time.

    If we instead assume that NAT timeouts might happen, and allow those to happen, and configure UPDATE_PERIOD to a longer value, for example several minutes or hours, then LwM2M Update causes engine to "resume" the DTLS connection, which means that it actually closes the socket and does a new handshake using DTLS Session resumption. This is almost like full DTLS-handshake but shorter. It is accepted by the server because it starts with normal DTLS Client-Hello.

    So any approach you take will consume a lot of bandwidth to keep the connection up.

    With DTLS Connection-Identifier you get rid of the issues with NAT timeouts because server side uses CID to identify connections instead of IP&port pair. Then NAT re-mapping does not interrupt the DTLS session, it only block the server from communicating to the device, until it does LwM2M Update which refreshes the IP&port mapping.

  • It seems like Leshan/Californium does not support DTLS CID. 

    It's supported in Californium, I added the support pretty early during the development of RFC 9146 (I'm one of the co-authors).

    Do you run Leshan on your own? Then you may need to enable CID via the configuration.

    Unfortunately, it's not only the DTLS layer, which may get "mixed up" by the NAT changes. So you need also to consider Leshan's other settings to configure it proper (maybe you need to open an ticket/issue in the Leshan project.)

Reply
  • It seems like Leshan/Californium does not support DTLS CID. 

    It's supported in Californium, I added the support pretty early during the development of RFC 9146 (I'm one of the co-authors).

    Do you run Leshan on your own? Then you may need to enable CID via the configuration.

    Unfortunately, it's not only the DTLS layer, which may get "mixed up" by the NAT changes. So you need also to consider Leshan's other settings to configure it proper (maybe you need to open an ticket/issue in the Leshan project.)

Children
  • Thank you  

    Yes, we are using our own Leshan server.

    https://github.com/eclipse-leshan/leshan/issues/1166

    I read through this issue, and I did not really understand how we would have to modify our Californium.properties file to support CID.

  • That ticket is from a time long ago.

    During the development of RFC 9146 the MAC calculation has changed pretty late. That caused also the usage of a new Hello Extension ID.

    Unfortunately, the mbedtls team wasn't able to adapt and update the implementation in time, therefore to complicated workaround in that old issue.

    Today for Californium you only need to enable DTLS 1.2 CID with

    # DTLS connection ID length. <blank> disabled, 0 enables support without
    # active use of CID.
    DTLS.CONNECTION_ID_LENGTH=6

    But I'm not sure, what is required for Leshan to handle the address changes in other layers as well. Therefore you maybe open an ticket there.

  • Try:

      # DTLS update address using CID on newer records.
      # Default: true
      DTLS.UPDATE_ADDRESS_USING_CID_ON_NEWER_RECORDS=true
      

    I don't see that `DTLS.CONNECTION_ID_LENGTH` setting in our Leshan server, so maybe it supports it by default, if the client requests Connection-ID.

    With the current Leshan, I have not seen any problems with Connection-ID. Even that "UPDATE_ADDRESS" setting seem to be on by default.

    You can verify your configuration by running one client against https://leshan.eclipseprojects.io/ it supports DTLS CID and updates client IP when I do LwM2M Update.

  • Thank you. We will try this. 

    Now, when the server responds to registration updates immediately, the device keeps running longer.

    However, we are consistently seeing that, after almost 50min, the board just restarts itself. See logs:

    [00:48:13.524,230] <inf> net_lwm2m_rd_client: Update Done
    [00:48:32.525,146] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:48:32.892,547] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:48:32.892,608] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:48:32.892,608] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:48:32.892,700] <inf> net_lwm2m_rd_client: Update Done
    [00:48:51.892,669] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:48:52.323,059] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:48:52.323,089] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:48:52.323,120] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:48:52.323,211] <inf> net_lwm2m_rd_client: Update Done
    [00:49:11.324,127] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:49:11.685,546] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:49:11.685,577] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:49:11.685,607] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:49:11.685,699] <inf> net_lwm2m_rd_client: Update Done
    [00:49:30.685,607] <dbg> app_lwm2m_client: rd_client_event: Registration update started
    [00:49:31.045,043] <inf> net_lwm2m_rd_client: Update callback (code:2.4)
    [00:49:31.045,074] <dbg> app_lwm2m_client: rd_client_event: Registration update complete
    [00:49:31.045,104] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:49:31.045,196] <inf> net_lwm2m_rd_client: Update Done
    uart:~$ *** Booting nRF Connect SDK v3.5.99-ncs1 ***
    I: Starting bootloader
    I: Primary image: magic=good, swap_type=0x2, copy_done=0x1, image_ok=0x1
    I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
    I: Boot source: none
    I: Image index: 0, Swap type: none
    I: Bootloader chainload address offset: 0x10000
    I: Jumping to the first image slot
    *** Booting nRF Connect SDK v3.5.99-ncs1 ***
    [00:00:00.255,310] <inf> app_lwm2m_client: Run LWM2M client
    [00:00:00.255,371] <dbg> app_lwm2m_client: watchdog_Init: Watchdog started: 0
    --- 14 messages dropped ---
    [00:00:00.495,605] <inf> app_lwm2m_client: Clearing PSK
    [00:00:00.526,641] <dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok!
    [00:00:00.805,114] <err> net_lwm2m_registry: obj field 300 not found
    [00:00:00.805,297] <err> net_lwm2m_registry: obj field 300 not found
    [00:00:00.805,450] <err> net_lwm2m_registry: obj field 300 not found

    Could there be something outside of the lwm2m client or anything else that is causing a reboot after 50 min?

    Here is our prj.conf file:

    # General config
    CONFIG_ASSERT=y
    CONFIG_REBOOT=y
    
    # Network
    CONFIG_NETWORKING=y
    CONFIG_NET_NATIVE=n
    CONFIG_NET_IPV6=n
    CONFIG_NET_IPV4=y
    CONFIG_NET_SOCKETS=y
    CONFIG_NET_SOCKETS_OFFLOAD=y
    
    # Sensors
    CONFIG_ADC=y
    CONFIG_SPI=y
    CONFIG_SPI_NRFX=y
    CONFIG_SENSOR=y
    CONFIG_I2C=y
    
    # LwM2M and IPSO
    CONFIG_LWM2M=y
    CONFIG_LWM2M_COAP_BLOCK_SIZE=1024
    CONFIG_LWM2M_COAP_MAX_MSG_SIZE=1280
    CONFIG_LWM2M_ENGINE_MAX_OBSERVER=40
    CONFIG_LWM2M_ENGINE_MAX_MESSAGES=40
    CONFIG_LWM2M_ENGINE_MAX_PENDING=40
    CONFIG_LWM2M_ENGINE_MAX_REPLIES=40
    CONFIG_LWM2M_ENGINE_STACK_SIZE=3072
    CONFIG_LWM2M_DNS_SUPPORT=y
    CONFIG_LWM2M_RW_JSON_SUPPORT=n
    CONFIG_LWM2M_CONN_MON_OBJ_SUPPORT=y
    CONFIG_LWM2M_CONN_MON_BEARER_MAX=2
    CONFIG_LWM2M_SERVER_DEFAULT_PMIN=0
    CONFIG_LWM2M_SERVER_DEFAULT_PMAX=0
    CONFIG_LWM2M_CLIENT_UTILS=y
    CONFIG_LWM2M_CLIENT_UTILS_FIRMWARE_UPDATE_OBJ_SUPPORT=y
    CONFIG_LWM2M_CLIENT_UTILS_LOCATION_OBJ_SUPPORT=n
    CONFIG_LWM2M_IPSO_SUPPORT=y
    
    
    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=4000
    CONFIG_COAP_RANDOMIZE_ACK_TIMEOUT=n
    CONFIG_COAP_BACKOFF_PERCENT=100
    CONFIG_COAP_MAX_RETRANSMIT=4
    
    # UART
    CONFIG_UART_INTERRUPT_DRIVEN=y
    
    # DTLS settings
    CONFIG_LWM2M_DTLS_SUPPORT=y
    
    # Modem key management
    CONFIG_MODEM_KEY_MGMT=y
    
    # Default app to debug logging
    CONFIG_LOG=y
    CONFIG_APP_LOG_LEVEL_DBG=y
    CONFIG_LOG_DEFAULT_LEVEL=3
    CONFIG_APP_ENDPOINT_PREFIX="xxx-"
    
    # Support HEX style PSK values (double the size + NULL char)
    CONFIG_LWM2M_SECURITY_KEY_SIZE=33
    
    # extend CoAP retry timing to 10 seconds for LTE/LTE-M
    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=10000
    
    # Enable CoAP extended option length
    CONFIG_COAP_EXTENDED_OPTIONS_LEN=y
    CONFIG_COAP_EXTENDED_OPTIONS_LEN_VALUE=40
    
    # Enable settings storage
    CONFIG_SETTINGS=y
    CONFIG_FCB=y
    CONFIG_SETTINGS_FCB=y
    CONFIG_FLASH_MAP=y
    CONFIG_STREAM_FLASH=y
    
    # LTE link control
    CONFIG_LTE_LINK_CONTROL=y
    CONFIG_LTE_NETWORK_MODE_LTE_M=y
    
    # Modem library
    CONFIG_NRF_MODEM_LIB=y
    
    # Modem info
    CONFIG_MODEM_INFO=y
    CONFIG_MODEM_INFO_ADD_DATE_TIME=n
    
    # Enable shell
    CONFIG_LWM2M_SHELL=y
    
    # Heap and stacks
    CONFIG_HEAP_MEM_POOL_SIZE=16384
    CONFIG_MAIN_STACK_SIZE=8192
    CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=2048
    CONFIG_AT_MONITOR_HEAP_SIZE=512
    
    # Allow FOTA downloads using download-client
    CONFIG_DOWNLOAD_CLIENT=y
    CONFIG_DOWNLOAD_CLIENT_STACK_SIZE=4096
    CONFIG_DOWNLOAD_CLIENT_HTTP_FRAG_SIZE_1024=y
    CONFIG_FOTA_DOWNLOAD=y
    
    # Application version
    CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION="1.0.0"
    
    # Set LwM2M Server IP address here
    CONFIG_LWM2M_CLIENT_UTILS_SERVER="coap://qiotb.qlocx.com:5683"
    
    # Application Event Manager
    CONFIG_APP_EVENT_MANAGER=y
    
    # Date-Time library
    CONFIG_DATE_TIME=y
    CONFIG_DATE_TIME_UPDATE_INTERVAL_SECONDS=86400
    
    # Enable LwM2M Queue Mode
    CONFIG_LWM2M_QUEUE_MODE_ENABLED=y
    
    # Use DTLS Connection ID
    CONFIG_LWM2M_DTLS_CID=y
    
    CONFIG_LWM2M_RD_CLIENT_SUPPORT_BOOTSTRAP=y
    CONFIG_LWM2M_CLIENT_UTILS_BOOTSTRAP_TLS_TAG=100
    CONFIG_LWM2M_CLIENT_UTILS_SERVER_TLS_TAG=200
    
    # When DTLS CID is used, we can keep the socket open.
    # If the server is not supporting CID, CONFIG_LWM2M_RD_CLIENT_SUSPEND_SOCKET_AT_IDLE should
    # be used instead.
    CONFIG_LWM2M_RD_CLIENT_STOP_POLLING_AT_IDLE=y
    
    # Enable TLS session caching to prevent doing a full TLS handshake when recovering the session
    CONFIG_LWM2M_TLS_SESSION_CACHING=y
    
    # Sets the duration that the lwm2m engine will be polling for data after transmission before
    # the socket is closed.
    # Adjust so that we can detach from network in 30 seconds
    CONFIG_LWM2M_QUEUE_MODE_UPTIME=20
    
    # Set lifetime of 24 hours
    CONFIG_LWM2M_ENGINE_DEFAULT_LIFETIME=86400
    
    # This is same as DEFAULT_LIFETIME - QUEUE_MODE_UPTIME + 1
    CONFIG_LWM2M_SECONDS_TO_UPDATE_EARLY=86381
    
    # Configure PSM mode
    CONFIG_LTE_PSM_REQ=y
    # Request periodic TAU of 3600 seconds (60 minutes)
    CONFIG_LTE_PSM_REQ_RPTAU="00000110"
    
    # Set Requested Active Time (RAT) to 30 seconds. Preferably same as the
    # configured LWM2M_QUEUE_MODE_UPTIME. Due to NAT/firewall UDP connections are usually
    # closed within 30-60 seconds so there is in general no point in setting a longer
    # Queue mode uptime / LTE PSM active time.
    CONFIG_LTE_PSM_REQ_RAT="00001111"
    
    # Request eDRX mode
    CONFIG_LTE_EDRX_REQ=y
    
    # Requested eDRX cycle length for LTE-M and Nb-IoT
    # This should be fine-tuned for the network and the chosen server.
    # Lowest value is  the most responsive, but uses more energy during the active eDRX period.
    # Longer period may cause more CoAP packet drops on server requests.
    # "0000" is 5.12 s
    # "0001" is 10.24 s
    # "0010" is 20.48 s.
    CONFIG_LTE_EDRX_REQ_VALUE_LTE_M="0001"
    CONFIG_LTE_EDRX_REQ_VALUE_NBIOT="0000"
    
    # Request Paging time window of 1.28 seconds for LTE-M
    CONFIG_LTE_PTW_VALUE_LTE_M="0000"
    
    # Request Paging time window of 2.56 seconds for NB-IoT
    CONFIG_LTE_PTW_VALUE_NBIOT="0000"
    
    # Get notification before Tracking Area Update (TAU). Notification triggers registration
    # update and TAU will be sent with the update which decreases power consumption.
    CONFIG_LTE_LC_TAU_PRE_WARNING_NOTIFICATIONS=y
    
    CONFIG_PM_DEVICE=y
    CONFIG_TFM_LOG_LEVEL_SILENCE=y
    # Optimize powersaving by using tickless mode in LwM2M engine
    CONFIG_NET_SOCKETPAIR=y
    CONFIG_LWM2M_TICKLESS=y
    # Enable Release Assistance Indication
    
    #LwM2M v1.1 configure
    CONFIG_LWM2M_VERSION_1_1=y
    CONFIG_LWM2M_RW_OMA_TLV_SUPPORT=y
    CONFIG_LWM2M_COMPOSITE_PATH_LIST_SIZE=20
    
    #Enable SenML-JSON content format
    CONFIG_BASE64=y
    CONFIG_JSON_LIBRARY=y
    CONFIG_LWM2M_RW_SENML_JSON_SUPPORT=y
    
    #Enable SenML-CBOR content format
    CONFIG_LWM2M_RW_SENML_CBOR_SUPPORT=y
    CONFIG_LWM2M_RW_SENML_CBOR_RECORDS=55
    CONFIG_ZCBOR_CANONICAL=y
    
    # Enable below for modem trace
    # CONFIG_NRF_MODEM_LIB_TRACE=y
    # CONFIG_NRF_MODEM_LIB_TRACE_BACKEND_RTT=y
    # CONFIG_USE_SEGGER_RTT=y
    #CONFIG_NRF_MODEM_LIB_TRACE_LEVEL_LTE_AND_IP=y
    
    ################# WATCHDOG ################################
    CONFIG_WDT_LOG_LEVEL_DBG=y
    CONFIG_WATCHDOG=y
    CONFIG_WDT_DISABLE_AT_BOOT=y
    
    # CONFIG_LWM2M_RD_CLIENT_MAX_RETRIES=

  • > Try:

    >  # DTLS update address using CID on newer records.
    >  # Default: true
    >  # DTLS.UPDATE_ADDRESS_USING_CID_ON_NEWER_RECORDS=true

    The more complete documentation is in the javadoc of DtlsConfig:

    /**
     * Update the ip-address from DTLS 1.2 CID records only for newer records
     * based on epoch/sequence_number.
     *
     * @see <a href= "">www.rfc-editor.org/.../rfc9146.html
     *      target= "_blank">RFC 9146, Connection Identifiers for DTLS 1.2, 6.
     *      Peer Address Update</a>
     */

    In general all CID record will update the address of the DTLS context. (If leshan is using that as well, is out of my scope). But assuming, that records may be received in inverse order, it may cause to update to a deprecated address. This setting therefore updates the address only for the newest record according the dtls record sequence number. In the very, very most cases it doesn't make a difference (because the record order doesn't change that frequently nor will the address change that fast), and it is already on per default.  

    I don't see that `DTLS.CONNECTION_ID_LENGTH` setting in our Leshan server

    Therefore I recommend to ask the leshan project, how they set that up. For Californium on it's own, it's required, because the default there is for v3. "off".

Related