Help with LTE connection after loss of connection

AnnaQ 10 months ago

We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection. We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network. the only way we can achieve a connection is if we do a sys_reboot() or a hard reboot with watchdog. Why is the reconnect attempts not enough? Why do we have to force a reboot? Attached are relevant logs.

Parents

0 AnnaQ 10 months ago
More info:

We are using NSC version 2.6.0 and mfw 1.3.6.

The sims are global (EU, Nordics, Baltics), see image of subscription details.We have mainly worked with Telia sims but some of our clients have tried using other with no improvement.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dejans 10 months ago in reply to AnnaQ

Hi,

Could you AnnaQ and Matias Marti please confirm that all your questions and logs are related to the same application (based on lwm2m_client in NCS v2.6.1)?

Could you please provide more information about your application? What exactly do you try to achieve? What does your application do?

We have noticed that several of our nRF9160 in production have problem with reconnecting to the LTE-M network after temporarily losing connection.

Can you provide more information on how devices lose connection? Do they work normally and suddenly lose connection? How often does this happen and on how many devices? Where are your failing devices located?

We have tried reproducing this issue by deactivating the sim for about 10 seconds and then activating it again. After the default six retries it gives up and stop trying to reconnect to the network.

Why do you think that the issue might be related to SIM? How do you do deactivation/activation of the SIM?

Matias Marti said:
It looks like the EXCHANGE_LIFETIME is set to 247s (4 minutes 7s) in the code here. Is there any way we can modify this value? Or is there another way to "give up" the exchange earlier?

Have you tried changing the value directly in the code?

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Matias Marti 8 months ago in reply to SeppoTakalo

SeppoTakalo thanks again.

Is there another way to prevent a NAT timeout? By setting CONFIG_LWM2M_QUEUE_MODE_UPTIME=15 we are using all the layers of communication here below, right?

On which of these layers does the NAT timeout occur? Is there a way to use your SDK so that it sends a message to the LTE network on the lowest possible layer? We would like to prevent NAT timeout without communicating all the way to the Leshan server.

Any thoughts?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 SeppoTakalo 8 months ago in reply to Matias Marti

https://en.wikipedia.org/wiki/Network_address_translation#One-to-many_NAT
NAT happens at IP layer.

So in order to keep the current mapping active, you must send UDP packets through in the higher frequency than what the network routers NAT timeout is.

So in a short: There is no way to prevent NAT timeout without sending packets.

If you want to prevent the timeout, you can configure for example CONFIG_LWM2M_UPDATE_PERIOD to be a small enough so that it does not cause NAT timeouts.

I would still recommend to keep the configs from previous example, so if for some reason, the timeout happens, next handshake would still be using session resumption.

There is a risk on that kind of configuration however: If we configure UPDATE_PERIOD shorter than QUEUE_MODE_UPTIME it means that LwM2M engine is never in so called RX_OFF state, or one might call it QUEUE mode. So it assumes that socket is constantly active, and does not try DTLS Session resumption.
If we end up sending LwM2M Update message into DTLS socket where NAT timeout have happened, it causes all DTLS packets to be ignore by the server. And then all the CoAP retry logic is just wasted time.

If we instead assume that NAT timeouts might happen, and allow those to happen, and configure UPDATE_PERIOD to a longer value, for example several minutes or hours, then LwM2M Update causes engine to "resume" the DTLS connection, which means that it actually closes the socket and does a new handshake using DTLS Session resumption. This is almost like full DTLS-handshake but shorter. It is accepted by the server because it starts with normal DTLS Client-Hello.

So any approach you take will consume a lot of bandwidth to keep the connection up.

With DTLS Connection-Identifier you get rid of the issues with NAT timeouts because server side uses CID to identify connections instead of IP&port pair. Then NAT re-mapping does not interrupt the DTLS session, it only block the server from communicating to the device, until it does LwM2M Update which refreshes the IP&port mapping.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Achim Kraus 8 months ago in reply to Matias Marti

Matias Marti said:
It seems like Leshan/Californium does not support DTLS CID.

It's supported in Californium, I added the support pretty early during the development of RFC 9146 (I'm one of the co-authors).

Do you run Leshan on your own? Then you may need to enable CID via the configuration.

Unfortunately, it's not only the DTLS layer, which may get "mixed up" by the NAT changes. So you need also to consider Leshan's other settings to configure it proper (maybe you need to open an ticket/issue in the Leshan project.)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Matias Marti 8 months ago in reply to Achim Kraus

Thank you Achim Kraus

Yes, we are using our own Leshan server.

https://github.com/eclipse-leshan/leshan/issues/1166

I read through this issue, and I did not really understand how we would have to modify our Californium.properties file to support CID.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Achim Kraus 8 months ago in reply to Matias Marti

That ticket is from a time long ago.

During the development of RFC 9146 the MAC calculation has changed pretty late. That caused also the usage of a new Hello Extension ID.

Unfortunately, the mbedtls team wasn't able to adapt and update the implementation in time, therefore to complicated workaround in that old issue.

Today for Californium you only need to enable DTLS 1.2 CID with

# DTLS connection ID length. <blank> disabled, 0 enables support without
# active use of CID.
DTLS.CONNECTION_ID_LENGTH=6

But I'm not sure, what is required for Leshan to handle the address changes in other layers as well. Therefore you maybe open an ticket there.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Achim Kraus 8 months ago in reply to Matias Marti

That ticket is from a time long ago.

During the development of RFC 9146 the MAC calculation has changed pretty late. That caused also the usage of a new Hello Extension ID.

Unfortunately, the mbedtls team wasn't able to adapt and update the implementation in time, therefore to complicated workaround in that old issue.

Today for Californium you only need to enable DTLS 1.2 CID with

# DTLS connection ID length. <blank> disabled, 0 enables support without
# active use of CID.
DTLS.CONNECTION_ID_LENGTH=6

But I'm not sure, what is required for Leshan to handle the address changes in other layers as well. Therefore you maybe open an ticket there.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

No Data