Help with LTE connection after loss of connection

We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection. We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network. the only way we can achieve a connection is if we do a sys_reboot() or a hard reboot with watchdog. Why is the reconnect attempts not enough? Why do we have to force a reboot? Attached are relevant logs.

Parents
  • More info: 

    • We are using NSC version 2.6.0 and mfw 1.3.6.
    • The sims are global (EU, Nordics, Baltics), see image of subscription details.We have mainly worked with Telia sims but some of our clients have tried using other with no improvement. 

  • Hi,

    Could you  and  please confirm that all your questions and logs are related to the same application (based on lwm2m_client in NCS v2.6.1)?

    Could you please provide more information about your application? What exactly do you try to achieve? What does your application do?

    We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection.

    Can you provide more information on how devices lose connection? Do they work normally and suddenly lose connection? How often does this happen and on how many devices? Where are your failing devices located?

    We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network.

    Why do you think that the issue might be related to SIM? How do you do deactivation/activation of the SIM?

    Matias Marti said:
    It looks like the EXCHANGE_LIFETIME is set to 247s (4 minutes 7s) in the code here. Is there any way we can modify this value? Or is there another way to "give up" the exchange earlier?

    Have you tried changing the value directly in the code?

    Best regards,
    Dejan

  • Hello  ,

    We are using ncs version 2.6.0. We will try to set up modem trace using Cellular Monitor and Wireshark and provide those logs.

    The goal of our application is to get as close to 100% uptime as possible. We have doors, lockers, and similar products across Europe. In order to be able to open these, they need to be able to receive and handle requests any time of the day and night. We want to be able to send requests to our devices 24/7, and sometimes the boards become unavailable for some minutes.

    Most of our devices are in the Nordics, and we have some devices (usually with poor connectivity) that fall asleep 15-20 times/day or even up to 50. It is almost 100% of the time that the devices sleep for 5 minutes (our watchdog triggers a powercycle if there is no successful register update for 5 min). Our IoT server considers a device to be asleep whenever there has been no interaction either way between the board and the server for 93s.

    The reason we where experimenting with the SIM (deactivating and re-activating the subscription) was to be able to try to reproduce the issue we have seen among devices that are part of our fleet. However, this deactivation triggers a different sort of error....

    Here is a screenshot of the change we now have made directly in the code. Just to force the same kind of error, We set the timeout to 1s as in the picture:

    And the resulting logs:

    [1;32muart:~$ [m[8D[J[00:52:14.181,915] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update complete[0m
    [1;32muart:~$ [m[8D[J[00:52:14.181,945] [0m<dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok![0m
    [1;32muart:~$ [m[8D[J[00:52:14.182,037] [0m<inf> net_lwm2m_rd_client: Update Done[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:33.182,037] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update started[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:34.284,301] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update failure![0m
    [1;32muart:~$ [m[8D[J[00:52:34.292,266] [0m<inf> net_lwm2m_rd_client: RD Client started with endpoint '<HIDDEN_DEVICE_ID>' with client lifetime 86400 using server object 0[0m
    [1;32muart:~$ [m[8D[J[00:52:34.619,720] [0m<inf> net_lwm2m_engine: Connected, sock id 2[0m
    [1;32muart:~$ [m[8D[J[00:52:34.622,833] [0m<dbg> app_lwm2m_client: rd_client_event: Disconnected[0m
    [1;32muart:~$ [m[8D[J[00:52:34.622,955] [0m<inf> net_lwm2m_rd_client: Stop LWM2M Client: <HIDDEN_DEVICE_ID>[0m
    [1;32muart:~$ [m[8D[J[00:52:34.708,404] [0m<inf> app_lwm2m_client: Client connect to server[0m
    [1;32muart:~$ [m[8D[J[00:52:34.708,465] [0m<inf> net_lwm2m_rd_client: Start LWM2M Client: <HIDDEN_DEVICE_ID>[0m
    [1;32muart:~$ [m[8D[J[00:52:34.709,106] [0m<inf> net_lwm2m_rd_client: RD Client started with endpoint '<HIDDEN_DEVICE_ID>' with client lifetime 86400 using server object 0[0m
    [1;32muart:~$ [m[8D[J[00:52:34.835,296] [0m<inf> app_lwm2m_client: LwM2M is connecting to server[0m
    [1;32muart:~$ [m[8D[J[00:52:35.023,590] [0m<inf> net_lwm2m_engine: Connected, sock id 2[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:35.173,797] [0m<dbg> app_lwm2m_client: rd_client_event: Registration complete[0m
    [1;32muart:~$ [m[8D[J[00:52:35.173,950] [0m<inf> net_lwm2m_rd_client: Registration Done (EP='<HIDDEN>')[0m
    [1;32muart:~$ [m[8D[J[00:52:35.357,604] [0m<inf> app_lwm2m_client: LwM2M is connected to server[0m
    [1;32muart:~$ [m[8D[J[00:52:35.357,666] [0m<inf> app_lwm2m_client: Obtained date-time from modem[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:54.173,492] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update started[0m
    [1;32muart:~$ [m[8D[J[00:52:54.586,181] [0m<inf> net_lwm2m_rd_client: Update callback (code:2.4)[0m
    [1;32muart:~$ [m[8D[J[00:52:54.586,212] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update complete[0m

    As we can see, we get the "Registration update failure!" whenever it takes longer than 1s for the registration update to complete. After the failure, the board is back to normal after a few seconds and succeeds in making a new registration update after ca 20s again as if nothing happened. This is the sort of behavior we would actually like to see whenever one of these occasional "pings" are taking too long. And by "too long" in this case, it would be 2 or 3 seconds for us and not 247s.

    A question: What is the expected behavior of the code in zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c if we were to set a sleep on our IoT server for e.g. 3 minutes? Would the client wait for a response for the whole time and therefore block other registration updates that normally occur every 20s?

Reply
  • Hello  ,

    We are using ncs version 2.6.0. We will try to set up modem trace using Cellular Monitor and Wireshark and provide those logs.

    The goal of our application is to get as close to 100% uptime as possible. We have doors, lockers, and similar products across Europe. In order to be able to open these, they need to be able to receive and handle requests any time of the day and night. We want to be able to send requests to our devices 24/7, and sometimes the boards become unavailable for some minutes.

    Most of our devices are in the Nordics, and we have some devices (usually with poor connectivity) that fall asleep 15-20 times/day or even up to 50. It is almost 100% of the time that the devices sleep for 5 minutes (our watchdog triggers a powercycle if there is no successful register update for 5 min). Our IoT server considers a device to be asleep whenever there has been no interaction either way between the board and the server for 93s.

    The reason we where experimenting with the SIM (deactivating and re-activating the subscription) was to be able to try to reproduce the issue we have seen among devices that are part of our fleet. However, this deactivation triggers a different sort of error....

    Here is a screenshot of the change we now have made directly in the code. Just to force the same kind of error, We set the timeout to 1s as in the picture:

    And the resulting logs:

    [1;32muart:~$ [m[8D[J[00:52:14.181,915] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update complete[0m
    [1;32muart:~$ [m[8D[J[00:52:14.181,945] [0m<dbg> app_lwm2m_client: watchdog_Kick: Watchdog feed ok![0m
    [1;32muart:~$ [m[8D[J[00:52:14.182,037] [0m<inf> net_lwm2m_rd_client: Update Done[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:33.182,037] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update started[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:34.284,301] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update failure![0m
    [1;32muart:~$ [m[8D[J[00:52:34.292,266] [0m<inf> net_lwm2m_rd_client: RD Client started with endpoint '<HIDDEN_DEVICE_ID>' with client lifetime 86400 using server object 0[0m
    [1;32muart:~$ [m[8D[J[00:52:34.619,720] [0m<inf> net_lwm2m_engine: Connected, sock id 2[0m
    [1;32muart:~$ [m[8D[J[00:52:34.622,833] [0m<dbg> app_lwm2m_client: rd_client_event: Disconnected[0m
    [1;32muart:~$ [m[8D[J[00:52:34.622,955] [0m<inf> net_lwm2m_rd_client: Stop LWM2M Client: <HIDDEN_DEVICE_ID>[0m
    [1;32muart:~$ [m[8D[J[00:52:34.708,404] [0m<inf> app_lwm2m_client: Client connect to server[0m
    [1;32muart:~$ [m[8D[J[00:52:34.708,465] [0m<inf> net_lwm2m_rd_client: Start LWM2M Client: <HIDDEN_DEVICE_ID>[0m
    [1;32muart:~$ [m[8D[J[00:52:34.709,106] [0m<inf> net_lwm2m_rd_client: RD Client started with endpoint '<HIDDEN_DEVICE_ID>' with client lifetime 86400 using server object 0[0m
    [1;32muart:~$ [m[8D[J[00:52:34.835,296] [0m<inf> app_lwm2m_client: LwM2M is connecting to server[0m
    [1;32muart:~$ [m[8D[J[00:52:35.023,590] [0m<inf> net_lwm2m_engine: Connected, sock id 2[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:35.173,797] [0m<dbg> app_lwm2m_client: rd_client_event: Registration complete[0m
    [1;32muart:~$ [m[8D[J[00:52:35.173,950] [0m<inf> net_lwm2m_rd_client: Registration Done (EP='<HIDDEN>')[0m
    [1;32muart:~$ [m[8D[J[00:52:35.357,604] [0m<inf> app_lwm2m_client: LwM2M is connected to server[0m
    [1;32muart:~$ [m[8D[J[00:52:35.357,666] [0m<inf> app_lwm2m_client: Obtained date-time from modem[0m
    [1;32muart:~$ [m[8D[J[1;32muart:~$ [m[8D[J[00:52:54.173,492] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update started[0m
    [1;32muart:~$ [m[8D[J[00:52:54.586,181] [0m<inf> net_lwm2m_rd_client: Update callback (code:2.4)[0m
    [1;32muart:~$ [m[8D[J[00:52:54.586,212] [0m<dbg> app_lwm2m_client: rd_client_event: Registration update complete[0m

    As we can see, we get the "Registration update failure!" whenever it takes longer than 1s for the registration update to complete. After the failure, the board is back to normal after a few seconds and succeeds in making a new registration update after ca 20s again as if nothing happened. This is the sort of behavior we would actually like to see whenever one of these occasional "pings" are taking too long. And by "too long" in this case, it would be 2 or 3 seconds for us and not 247s.

    A question: What is the expected behavior of the code in zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c if we were to set a sleep on our IoT server for e.g. 3 minutes? Would the client wait for a response for the whole time and therefore block other registration updates that normally occur every 20s?

Children
  • A question: What is the expected behavior of the code in zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c if we were to set a sleep on our IoT server for e.g. 3 minutes? Would the client wait for a response for the whole time and therefore block other registration updates that normally occur every 20s?

    Can you clarify this question a bit. I really don't understand.

    What goes to sleep?

    The Update period and lifetime can be configured in the client. It is documented in https://docs.zephyrproject.org/latest/connectivity/networking/api/lwm2m.html#configuring-lifetime-and-activity-period

    Then when client is registered, what it does it documented in https://docs.zephyrproject.org/latest/connectivity/networking/api/lwm2m.html#lwm2m-engine-and-application-events

    Basically it constantly loops between those four or three states, green ones. Depending on if the QUEUE mode is used or not. If 100% connectivity is assumed, QUEUE mode is not relevant.

    When LwM2M Update message is send, the CoAP layer retries the message until response is received. If not, then it should go into NETWORK_ERROR state and start recovery but trying to re-handshake the DTLS connection.

    However, NAT timeouts can happen. If NAT timeout is 20s and your update period is 20s, it might cause IP address to change randomly. So it might break the connectivity. DTLS Connection ID helps this, but it also needs server support.

    It helps if you can capture a Wireshark log of the event. It helps to debug if you know whether the connectivity issue is in IP layer, DTLS layer or CoAP layer.

  • Thank you for your answer,  .

    Can you clarify this question a bit. I really don't understand.

    What goes to sleep?

    By sleep, I meant that we make sure our Leshan server waits for 3 minutes before it provides a response for registration updates. This is a way for us to try to reproduce the behavior we are seeing in our boards.

    The Update period and lifetime can be configured in the client. It is documented

    We have taken a look at these values, but the issue is that one single update is taking more than the usual 0.8s, so how would changing these configuration values change the behavior we are seeing?

    If 100% connectivity is assumed, QUEUE mode is not relevant.

    We are not assuming 100% connectivity. We are using QUEUE on our Leshan server to detect when a device becomes unavailable.


    When LwM2M Update message is send, the CoAP layer retries the message until response is received.

    Is there any way that we can set a timeout for registration updates? I mean, in the sense that our board should "give up" on a registration update after 3-4s instead of 247s.

    It helps if you can capture a Wireshark log of the event.

    We will try to provide you with Wireshark logs shortly.

  • OK. Now I understand. So the response for this question:

    What is the expected behavior of the code in zephyr/subsys/net/lib/lwm2m/lwm2m_rd_client.c if we were to set a sleep on our IoT server for e.g. 3 minutes?


    So in this case, the LwM2M engine goes into UPDATE_REGISTRATION state. Sends a CoAP message. Goes into UPDATE_SENT state and waits for EXCHANGE_LIFETIME seconds on that state.

    During that time lwm2m_engine.c:socket_loop() ocassionally checks if the CoAP message needs retransmission. If CONFIG_COAP_INIT_ACK_TIMEOUT_MS has passed, message is retransmitted. It will be retransmitted total of CONFIG_COAP_MAX_RETRANSMIT times. Every time the Ack timeout is increased by CONFIG_COAP_BACKOFF_PERCENT.

    If response is not received until max retries are send and ack times out, CoAP layer call message timeout callback, which in this case should be lwm2m_rd_client.c:do_update_timeout_cb().
    If you don't see this callback triggering, you might have too long CoAP timeouts configured. State transitions should never timeout. That is just a final fallback mechanism if all else fails.

    update_timeout_cb() would trigger full registration (DO_REGISTRATION state) to recover.

    We have taken a look at these values, but the issue is that one single update is taking more than the usual 0.8s, so how would changing these configuration values change the behavior we are seeing?

    If you want to react a CoAP failures faster, reconfigure those mentioned CONFIG_COAP timeout values. Just be aware that LTE network might not be fast enough to respond.

    Also, if you need to be fast to respond, make sure that you don't use PSM power saving. And if eDRX is used, configure the periods to shortest possible, or disable it. Not all networks allow you to do short eDRX periods.

    We are not assuming 100% connectivity. We are using QUEUE on our Leshan server to detect when a device becomes unavailable.

    This is unreliable as Leshan is just waiting for MAX_TRANSMIT_WAIT(93s) as specified in CoAP+LwM2M specs. This does not match cellular networks where NAT timeout will happen much sooner, sometimes in 20 seconds.

    You need device to actively "ping" the network to keep the NAT mapping active. Also you need to configure or disable LTE powersaving modes to maintain reachability in short timeout.

    Is there any way that we can set a timeout for registration updates? I mean, in the sense that our board should "give up" on a registration update after 3-4s instead of 247s.

    LwM2M engine uses those mentioned CONFIG_COAP values. So all CoAP confirmable messages use same values. I don't see why it would make sense to modify these per message type. Any CoAP confirmable message refreshes the NAT mappings and allows server to communicate.

    Also, I think it is not reasonable to expect latency less than 4s in LTE network. Sometimes if the connection is idle for more that 5 seconds, the LTE network tears down the RRC connection. When next message is to be send, modem needs to establish this RRC connection with the network first. It might take time.
    Therefore in our sample, for LTE-M we use
    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=4000

    And for Nb-IOT we use
    CONFIG_COAP_INIT_ACK_TIMEOUT_MS=15000

    If you need to react faster, maybe the BACKOFF_PERCENT and MAX_RETRANSMIT can be modified.
  • Hi  ,

    Matias Marti said:
    We will try to provide you with Wireshark logs shortly.

    If needed, please provide server log and Wireshark capture for further analysis.

    Best regards,
    Dejan

  • LwM2M engine uses those mentioned CONFIG_COAP values. So all CoAP confirmable messages use same values. I don't see why it would make sense to modify these per message type. Any CoAP confirmable message refreshes the NAT mappings and allows server to communicate.

    Thank you for your answer.

    The reason it would make sense, for us at least, to modify these CONFIG_COAP values per message type is the following:

    We are satisfied with how lwm2m + COAP is working in all aspects apart from these 4min7s registation update timeouts. Here is some more concrete context:

    Let us suppose 1 out of every 1000 (0,1%) of all registration updates of a device end up reaching the default timeout. With a ping every 20s (180 pings/h), the board will become unreachable (i.e. unusable) for a whole 4min7s every 5h34min if these timeouts are evenly distributed.

    The issue we are dealing with is specific to the registration updates, and therefore we would like the solution to be specific to these requests. By changing the CONFIG_COAP values globally, we will be affecting all other lwm2m/COAP requests, which is something we rather not do.

    Maybe I am missing something, but why would we want to change the timeout for every single type of request when it is only one specific type of request that is problematic for us?

Related