LwM2M client: registration SOMETIMES fails; observers not set.

We're developing an LwM2M application based on the lwm2m client sample with the goal of sending data to ThingsBoard as our main telemetry platform.

The Application in based on the v2.4.2 SDK, running modem firmware mfw_nrf9160_1.3.4 on the nRF9160 SICA B1 on a custom board. Thingsboard has recently implemented a fix which allowed us to start testing on ThingsBoard instead of Coiote 1-2 weeks ago and we now have a lot more logging available to us (ThingsBoard transport log and wireshark for network traffic).

Registration generally works well; all attributes are exchanged and observers are set and later regularly updated. However, SOMETIMES (in ~15% of registrations) registration fails and the device stops sending responses to the server's requests and fails to set the defined observers.

There is no change in the logs of the client device between successful and unsuccessful attempts, except of course for the observers not being set, although outgoing and incoming coap packages are currently not being logged on our lwm2m client device (I haven't looked into this yet).

Is this a configuration issue or some compatibility problem with thingsboard?

Here is the .config file of the build followed by screenshots of the server and network logs. (CONFIG_LWM2M_ENGINE_MAX_MESSAGES, MAX_PENDING, MAX_REPLIES and MAX OBSERVER were only recently increased from 15 to 30, and I've since run into other problems, but I think it wasn't the issue?)

Here you can see the wireshark log of a failed attempt:

And here a successful attempt:

The content of the messages is identical, up to the point where the failed registration stops answering.

1. client device sends initial message with object versions etc. to the server

2. Server answers with the registration ID, confirming registration and follows up with read requests for all the specified attributes (here 26).

3a. In the failed attempt, 16 requests are answered, then the device stops reacting

3b. In the successful attempt, 18 requests are answered, before the server repeats a few of the unanswered requests, then after a back and forth of requesting 3-4 resources and them being answered, the 4 observe requests are sent, the client confirms, the server sends pmin and pmax, the client sends a 204 response and the registration is complete.

Server-side logs:

Unsuccessful attempt:

Last logs of a successful attempt:

I hope I have provided enough details and thank you for your help

Best Regards

Alan

Parents
  • I can see from the Wireshark that ThingsBoard servers sending almost 40 requests in 3 seconds when the device registers. There seem to be many requests ongoing simultaneously.

    It is unrealistic to expect that all packets are going through. So of course, you can expect those EAGAIN errors. Server should be re-configured to send only one request at a time. This is how AVSystem is working, and we have not seen any errors on that side.

    UDP is a lossy protocol and packet drops are expected when channel in congested. The retry logic is handled in CoAP level. Please note that it is the server that is sending the requests (CoAP CON) and the client is just sending Ack. When Ack packet is lost or dropped, the CON packet is going to be re-send. Ack packets are not re-send, just dropped.

    So recovering from this situation is not handled by the client. It should be handled by the server side when it does not see the Ack packet.

    I did not see the registration failure from Wireshark log. How did you come to conclusion that there was registration failure if that was not seen in the log?

    In this scenario, registration is the only request (CoAP CONfirmable) that is send from the client, and if the response is lost, it is going to retry.

    Another one, just a performance thing: The client here does not seem to use DTLS Connectiond ID and when it does LwM2M Update, it seem to tear down the DTLS session. This is not efficient. If you can, update modem to 1.3.6 (and SDK as well) to allow DTLS Connection ID. If you cannot, at least enable DTLS Session resumption.

  • Let me split it:

    One point is the large number of requests in sequence, which causes congestion and retransmissions. Not optimal.

    The other point is, that the client (device) stops then to ACK/response to the retransmissions. Looks like, the client is not intended to handle errors, except handling them with restarting the device. But that is obvious ways more not optimal.

    Just to make sure, we're on the same messages:

    5101 registration device => lwm2m server

    5135 read request lwm2m server => device, the first without direct ACK/response

    5140 read request lwm2m server => device, the first retransmission

    5141 late ACK to 5135

    5142-5144, 5163, 5181-5184, 5201-5203 all retransmissions, which are well according RFC 7252, but the client (device) doesn't comply to RFC 7252.

    The device seems to stick to the error code "-11" (EAGAIN), it doesn't longer ACK nor respond to the retransmissions. the client doesn't send anything at all.

    If then the only way to overcome that is really a device restart, the follow-up question, what will happen on other errors, e.g. network loss, gets more and more important.

    So, what is the device strategy to overcome (temporary) errors?

  • 202405161621_testserver_v2.6.0.pcapng

    Thanks Seppo

    I have made a tracefile, but I haven't had the time to capture a reregistration after the 2h. I hope to get around to that tomorrow.

    Maybe you can find something in this file, else the file tomorrow should be more interesting.

  • So we've managed to observe a device just doing a normal registration instead of a full reregistration after 2 hours, which is great.

    On the other hand, we've just started to get a weird registration behaviour on multiple of our devices that might be linked to CID.

    Here's the trace file:

    202405171546_CID_registration_problems.pcapng

    And the corresponding current measurement:

    Seppo or Achim, are you familiar with what is going on?

    Thanks for your continued support.

  • The capture shows the retransmission of the last client's flight. For PSK that indicates mostely, that the "pre shared secret" is different. Please check that.

    Sometime it's even that simple, that one side takes a "hex-string" as text, while the other decode it to binary.

    e.g.

    "abcd" is either

    two bytes 0xab 0xcd or

    four bytes 0x61 0x62 0x63 0x64

  • Hi Achim

    Thanks for the Input.

    I managed to capture the observer update with wireshark and the joulescope.

    here's the wireshark capture of the modem trace (Observer update from lines 191 to 204):

    202405210923_CID_key_exchange_retries_with_observer_update.pcapng

    And the joulescope output (Observer update in red):

    Update marked in red, periodic modem acitivity marked in green.

    edit: i think it might be user error, give me a second to investigate. maybe just wrong credentials on platform.

    edit2: yup, my bad. I'll leave it up in case it helps anyone else.

  • I managed to capture the observer update with wireshark and the joulescope.

    I don't think this is a correct capture.

    What I see there in the Wireshark is just two failed DTLS handshakes and some HTTPS traffic from 172.217.168.42 which the device is trying to reply with RST.

    So clearly the device is not registered.

    I don't know where the TLS/HTTPS traffic is coming from.. has there been FOTA download ongoing and it has not been closed succesfully, or is this some weird form of port scanning.

Reply
  • I managed to capture the observer update with wireshark and the joulescope.

    I don't think this is a correct capture.

    What I see there in the Wireshark is just two failed DTLS handshakes and some HTTPS traffic from 172.217.168.42 which the device is trying to reply with RST.

    So clearly the device is not registered.

    I don't know where the TLS/HTTPS traffic is coming from.. has there been FOTA download ongoing and it has not been closed succesfully, or is this some weird form of port scanning.

Children
  • Yeah, my mistake, had a mixup with the firmware, devices and the 2 different servers... as you say, the credentials were not correctly defined on the server.