This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

LWM2M + DTLS + NBIoT proper implementation

Hi,

I am trying to implement energy efficient(batt powered) application for nrf9160, using UDP/DTLS/LWM2M as comm stack. Got "hello world" working, but now I am really confused about many details. Hope you can help me.

Question 1: Operator NAT/firewall rules and NRF9160 DTLS implementation

It looks like operator networks loose UDP port rules about after 1 minute of silence. This essentially means that UDP port number changes between consecutive packets sent from device to server if communications happens for example in every 10 minutes. This again means that DTLS connection would need to be resumed for it to work. It looks like resumption does not work in NRF or is not implemented at all? Effect is dramatic, you have to either ping server in <60s intervals or go through connection failure+re-register cycle every time you send anything.

Am I missing something here or how do people in general make things work? This ruins the whole idea of power save modes and LWM2M benefits. Is the only workaround to use MBED TLS stack instead?

Question 2: It looks like DTLS connection(to LWM2M bootstrap server) always ends to TLS ALERT sent by nrf91. Any idea why? Is this a feature? Other clients do not do that.

Question 3: Zephyr does not support QUEUE mode. If I understood the LWM2M spec correctly, QUEUE mode is exactly what low power NB-IoT application needs, but zephyr LWM2M stack does not support it. Do you have any idea how Nordic users tend to solve this? Implement queues as server side application on top of LWM2M? Use different LWM2M client?

Probably not the easiest questions and not all directly in Nordics domain, but I hope that you have some ideas how to make actual production quality app using NRF91. Thanks.

Parents
  • Hi.

     

    Question 1: Operator NAT/firewall rules and NRF9160 DTLS implementation

     In the DTLS header, there will be a session token indicating which DTLS session the packet belongs to.

    It is up to the server to decide if the session has timed out or not. So as long as all communication is initiated by the device, the NAT/firewall rules should not be a huge problem.

     

    It looks like resumption does not work in NRF or is not implemented at all?

     It is, but at the moment, there is a bug that can casue the modem to crash when (D)TLS session caching is enabled. The bug will be fixed in the next modem release.

     

    Question 2: It looks like DTLS connection(to LWM2M bootstrap server) always ends to TLS ALERT sent by nrf91. Any idea why? Is this a feature? Other clients do not do that.

     While I can not say for certain without more information (a modem trace is the most helpfull), I have two hyptohesis:

    1. The server's certificates are too large. The nRF9160 only supports TLS frames up to 2048 bytes.

    2. The nRF9160 does not support SNI at the moment. This will be added in the next modem release.

     

    Question 3: Zephyr does not support QUEUE mode. If I understood the LWM2M spec correctly, QUEUE mode is exactly what low power NB-IoT application needs, but zephyr LWM2M stack does not support it. Do you have any idea how Nordic users tend to solve this? Implement queues as server side application on top of LWM2M? Use different LWM2M client?

     Please take a look at https://github.com/NordicPlayground/fw-nrfconnect-nrf/pull/2252

    Best regards,

    Didrik

  • Thanks for clarifications. Just to make sure that I understood this correctly

                It is, but at the moment, there is a bug that can casue the modem to crash when (D)TLS session caching is enabled. The bug will be fixed in the next modem release.

    As I am not seeing session resumption, it means that I have to explicitly enable the session caching somehow, right? What is the API? I found this post where it says that you have to use nrf_setsock_opt https://devzone.nordicsemi.com/f/nordic-q-a/55335/dtls-session-resumption-on-nrf9160-modem-fw-v1-1-0

    And the ?random? crashes caused by activating the cache will be fixed in 1.3.0, which will be release at unknown date?

    And to make this work right now, the only option is to not use offloaded NRF sockets at all, but something else instead, right?

  • Hi, both of you.

    These are good questions, but I am not familiar enough with the fine details to give an answer myself.

    I have forwarded them to our developers and will get back to you when I have an answer.

    Best regards,

    Didrik

  • DO you see a different behavior if you set the TLS_SESSION_CACHE socket option explicitly?

    https://github.com/nrfconnect/sdk-zephyr/blob/master/include/net/socket.h#L126

    Edit: Note that the option must be set using setsockopt(), not the non-offloaded nrf_setsockopt().

    Or, you can use nrf_setsockopt() and NRF_SO_SEC_SESSION_CACHE.

  • I did try to set cache before like this:

    nrf_sec_session_cache_t session_cache = 1;
    if (nrf_setsockopt(client_ctx->sock_fd, NRF_SOL_SECURE, NRF_SO_SEC_SESSION_CACHE, &session_cache, sizeof(session_cache)) < 0) {
    	LOG_ERR("setsockopt NRF_SO_SEC_SESSION_CACHE: %d", errno);
    }

    Right after this setsockopt https://github.com/zephyrproject-rtos/zephyr/blob/358dcc1bde6a79abb3167e994fe7aec66968adaf/subsys/net/lib/lwm2m/lwm2m_engine.c#L4241

    But that had no effect, and I just retried and confirmed that

    I can see TLS_SESSION_CACHE was just added to master, so I can't use that in 1.2. Does it make any difference in which way the opt is set? Should I try it on master?(I assume that the problem lies in modem side)

  • Hi,

    I had a look at ip_port_changes_90s.pcapng, and it seems like (D)TLS caching is already enabled. The session ID is at least provided in the client hello message, but it seems like the server side has closed that session, possibly because of the preceeding alert message. You can check which nrfxlib revision you are using by executing "west status" which should print the revision of all repositories west has imported for you. That will show if you have the revision where TLS caching is enabled by default or not.

    The alert messages from the device are curious, and looks likely to be the source of trouble here. Could you try to enable  CONFIG_LWM2M_LOG_LEVEL_DBG=y and see if that yields any useful information, in particular related to the alert message?

    Best regards,

    Jan Tore

  • Hi,

    Did you notice that the connection that ends to alerts is a separate connection to different port? It is actually the bootstrap server connection. Maybe I should try to do the connection completely without the bootstrap.

    nrfxlib is at this version https://github.com/nrfconnect/sdk-nrfxlib/commit/7d5e7624ec7fdb071ab8def3ae772e5ee5b65f21

    and yes, there is one error printed when boostrap operation completes

    00> [00:00:37.000,762] <inf> net_lwm2m_rd_client: Bootstrap data transfer done!
    00> [network2.c] [0x2002302c] D Bootstrap transfer complete
    00> [network2.c] [0x2002302c] D '0/1/0': 'coaps://lwm2m.ycfwydu.com:8601' '0x00'
    00> [00:00:37.001,159] <dbg> net_lwm2m_engine.lwm2m_engine_get: path:1/0/0, buf:0x20029e0e, buflen:2
    00> [00:00:37.001,251] <dbg> net_lwm2m_engine.lwm2m_engine_get: path:1/0/1, buf:0x20029e0e, buflen:2
    00> [00:00:37.001,678] <err> net_lwm2m_engine: Err sending response: -22

Reply
  • Hi,

    Did you notice that the connection that ends to alerts is a separate connection to different port? It is actually the bootstrap server connection. Maybe I should try to do the connection completely without the bootstrap.

    nrfxlib is at this version https://github.com/nrfconnect/sdk-nrfxlib/commit/7d5e7624ec7fdb071ab8def3ae772e5ee5b65f21

    and yes, there is one error printed when boostrap operation completes

    00> [00:00:37.000,762] <inf> net_lwm2m_rd_client: Bootstrap data transfer done!
    00> [network2.c] [0x2002302c] D Bootstrap transfer complete
    00> [network2.c] [0x2002302c] D '0/1/0': 'coaps://lwm2m.ycfwydu.com:8601' '0x00'
    00> [00:00:37.001,159] <dbg> net_lwm2m_engine.lwm2m_engine_get: path:1/0/0, buf:0x20029e0e, buflen:2
    00> [00:00:37.001,251] <dbg> net_lwm2m_engine.lwm2m_engine_get: path:1/0/1, buf:0x20029e0e, buflen:2
    00> [00:00:37.001,678] <err> net_lwm2m_engine: Err sending response: -22

Children
  • Hi,

    I looked up that nrfxlib version, and it is with TLS caching enabled by default.

    Thanks for claryfying the bootstrap at the beginning. I think the Encrypted alert is sent to notify the server that the connection can be closed. The server also responds in packet 21. I'm not sure about the error message in the application log, unfortunately. It seems to happen after the bootstrap procedure is completed, so it could be unrelated to these issues. I think we would need modem traces from the device side to get to the bottom of that one.

    I'm starting to wonder if might have misunderstood the issue here, as I see session resumption is being used in ip_port_changes_90s.pcapng:

    Looking at the next time the NAT routing has timed out (new src port), we get the seconds Encrypted alert in packet 41. After that, session resumption is attemped in client hello in packet 42. The server responds that it recognizes the session ID in packet 43, and the session is resume, and both server and client notify each other to change cipher spec. 

    Please correct me if I'm missing the issue here.

    Thanks,

    Jan Tore

  • Indeed, I completely missed that session ID in packet #42.

    What happens in the upper layer(lwm2m) is that registration update timeouts(because server does not respond). Registration update and retries are in packets 38, 39 and 40.

    After this lwm2m closes the socket(which causes the Alert, packet 41) and starts a new socket to re-register. This socket open causes the Client hello(42), which, as you said, looks like resume.

    Don't know what to think here. Something obviously triggers the resume, but I hope it is not only closing and opening the socket. I could try to hack lwm2m to do that anyway

    I also need to continue working to get those modem traces...

  • The session resumption happens because the (D)TLS session caching socket option is set by default on the version of BSD library that you're using. So if you stick to that, the modem will always try to resume the previoud session and let the server decide what to do.

    So this mainly sounds like an LwM2M issue now that it looks like session resumption is working as intended. Let me know if you don't fully agree on this.
    My familiarity with LwM2M is unfortunately very limited, so I'm not much help on that front.
    I know the issue has been reported internally, which will hopefully give results on the LwM2M debugging.

    Best regards,

    Jan Tore

  • Adding a bit more to my previous answer, I didn't properly explaing why I think this is LwM2M (implementation) issue.

    It looks like LwM2M has mechanism for retrying packet sends 3 times with doubling back-off time, starting at 10 seconds. After that it closes and reopens socket, which triggers DTLS resumption. So to properly work around this, there would have to be a way to detect that the NAT rules are removed. Usually you can assume this after ~60 seconds of UDP inactivity between client and server, but this is highly network dependent.

    From the Leshan PCAP file, it looks like the client just assumes that the DTLS session is gone before attempting to send data. This is a viable workaround also for the nRF91 device, though I don't know if it's configurable or if you will have to make some code changes in the LwM2M engine.

    Update: I poked around a bit, and this seems to actually be a CoAP thing. You can adjust CONFIG_COAP_INIT_ACK_TIMEOUT_MS in your prj.conf to reduce the retry time and have resumption kick in faster.

  • Ok,

    I already started to agree with you that this is LWM2M thing and I wrote and issue to zephyr to get some feedback about what they think

    https://github.com/zephyrproject-rtos/zephyr/issues/25935

    I have the CONFIG_COAP_INIT_ACK_TIMEOUT_MS already set to 10s, because with very small values you got bursts of retransmissions.

    Anyway, I believe it would be better to go to the resumption immediately and not via some fail -> retry -> fail -> close&reopen scenario

Related