LwM2M client: registration SOMETIMES fails; observers not set.

We're developing an LwM2M application based on the lwm2m client sample with the goal of sending data to ThingsBoard as our main telemetry platform.

The Application in based on the v2.4.2 SDK, running modem firmware mfw_nrf9160_1.3.4 on the nRF9160 SICA B1 on a custom board. Thingsboard has recently implemented a fix which allowed us to start testing on ThingsBoard instead of Coiote 1-2 weeks ago and we now have a lot more logging available to us (ThingsBoard transport log and wireshark for network traffic).

Registration generally works well; all attributes are exchanged and observers are set and later regularly updated. However, SOMETIMES (in ~15% of registrations) registration fails and the device stops sending responses to the server's requests and fails to set the defined observers.

There is no change in the logs of the client device between successful and unsuccessful attempts, except of course for the observers not being set, although outgoing and incoming coap packages are currently not being logged on our lwm2m client device (I haven't looked into this yet).

Is this a configuration issue or some compatibility problem with thingsboard?

Here is the .config file of the build followed by screenshots of the server and network logs. (CONFIG_LWM2M_ENGINE_MAX_MESSAGES, MAX_PENDING, MAX_REPLIES and MAX OBSERVER were only recently increased from 15 to 30, and I've since run into other problems, but I think it wasn't the issue?)

Here you can see the wireshark log of a failed attempt:

And here a successful attempt:

The content of the messages is identical, up to the point where the failed registration stops answering.

1. client device sends initial message with object versions etc. to the server

2. Server answers with the registration ID, confirming registration and follows up with read requests for all the specified attributes (here 26).

3a. In the failed attempt, 16 requests are answered, then the device stops reacting

3b. In the successful attempt, 18 requests are answered, before the server repeats a few of the unanswered requests, then after a back and forth of requesting 3-4 resources and them being answered, the 4 observe requests are sent, the client confirms, the server sends pmin and pmax, the client sends a 204 response and the registration is complete.

Server-side logs:

Unsuccessful attempt:

Last logs of a successful attempt:

I hope I have provided enough details and thank you for your help

Best Regards

Alan

Parents
  • I can see from the Wireshark that ThingsBoard servers sending almost 40 requests in 3 seconds when the device registers. There seem to be many requests ongoing simultaneously.

    It is unrealistic to expect that all packets are going through. So of course, you can expect those EAGAIN errors. Server should be re-configured to send only one request at a time. This is how AVSystem is working, and we have not seen any errors on that side.

    UDP is a lossy protocol and packet drops are expected when channel in congested. The retry logic is handled in CoAP level. Please note that it is the server that is sending the requests (CoAP CON) and the client is just sending Ack. When Ack packet is lost or dropped, the CON packet is going to be re-send. Ack packets are not re-send, just dropped.

    So recovering from this situation is not handled by the client. It should be handled by the server side when it does not see the Ack packet.

    I did not see the registration failure from Wireshark log. How did you come to conclusion that there was registration failure if that was not seen in the log?

    In this scenario, registration is the only request (CoAP CONfirmable) that is send from the client, and if the response is lost, it is going to retry.

    Another one, just a performance thing: The client here does not seem to use DTLS Connectiond ID and when it does LwM2M Update, it seem to tear down the DTLS session. This is not efficient. If you can, update modem to 1.3.6 (and SDK as well) to allow DTLS Connection ID. If you cannot, at least enable DTLS Session resumption.

  • Thanks for taking the time, Seppo

    It is unrealistic to expect that all packets are going through. So of course, you can expect those EAGAIN errors. Server should be re-configured to send only one request at a time. This is how AVSystem is working, and we have not seen any errors on that side.

    We have learned a lot since making this post, and we have realised that the server does indeed cause most of the problems, since it's missing 2-3 years worth of updates. We are hoping that things will improve with the next update which should bring the server up to date.

    For now, we sadly have to do without some of the central LwM2M functionality (observers, etc.). We now let the device register, setting no observers and send a ReadComposite request on every registration update at a fixed interval (lifetime - update_early). We can live with this solution for now, although it is a shame to miss out on observers and more flexible data transmission methods.

    I did not see the registration failure from Wireshark log. How did you come to conclusion that there was registration failure if that was not seen in the log?

    Yea, sorry, I chose my words poorly. What I meant by a failed registration was that the client stopped answering at some point and would be unable to set the observers, causing it to go MIA until the Lifetime - Time_to_update_early timer ran out.

    This happened twice at the End of the Wireshark file (around registration number 12 or so), which - as you can see in the latest comments by Achim and myself - fails to load in its entirety most of the time.

    Another one, just a performance thing: The client here does not seem to use DTLS Connectiond ID and when it does LwM2M Update, it seem to tear down the DTLS session. This is not efficient. If you can, update modem to 1.3.6 (and SDK as well) to allow DTLS Connection ID. If you cannot, at least enable DTLS Session resumption.

    I am aware of CID and that we need to update our application to a compatible toolchain for modem version 1.3.6. We will do that as soon as we can start optimising the application.

  • Yea, sorry, I chose my words poorly. What I meant by a failed registration was that the client stopped answering at some point and would be unable to set the observers, causing it to go MIA until the Lifetime - Time_to_update_early timer ran out.

    This happened twice at the End of the Wireshark file (around registration number 12 or so), which - as you can see in the latest comments by Achim and myself - fails to load in its entirety most of the time.

    You might be seeing and issue that is fixed in https://github.com/zephyrproject-rtos/zephyr/pull/66619

    With the Zephyr LwM2M engine, I have spend significant time running various tests against Leshan and AVSystem on Ethernet, LTE-M and Nb-IOT. I have not seen any problems with CoAP transmissions but I have seen few problems with Queue mode.

    Most significant if of course the DTLS connection ID. That fixes majority of issues.
    But then the Queue mode RX OFF might be sometimes triggered while there is still traffic ongoing. That is fixed in the PR that I pointed earlier.

    If you can, update your SDK base to get latest fixes from Zephyr LwM2M engine. There has been some work that have been done to queue mode, DTLS and configuring of update frequencies.

    As a separate thing: You mention that you use Composite-Read on every update?
    Why don't you have the logic in the application side? LwM2M SEND can be used if your application already know what data you want to send to server. Then if needed, it will trigger and update. In your case, it may allow flexibility on update frequency, but also saves one READ command.

Reply
  • Yea, sorry, I chose my words poorly. What I meant by a failed registration was that the client stopped answering at some point and would be unable to set the observers, causing it to go MIA until the Lifetime - Time_to_update_early timer ran out.

    This happened twice at the End of the Wireshark file (around registration number 12 or so), which - as you can see in the latest comments by Achim and myself - fails to load in its entirety most of the time.

    You might be seeing and issue that is fixed in https://github.com/zephyrproject-rtos/zephyr/pull/66619

    With the Zephyr LwM2M engine, I have spend significant time running various tests against Leshan and AVSystem on Ethernet, LTE-M and Nb-IOT. I have not seen any problems with CoAP transmissions but I have seen few problems with Queue mode.

    Most significant if of course the DTLS connection ID. That fixes majority of issues.
    But then the Queue mode RX OFF might be sometimes triggered while there is still traffic ongoing. That is fixed in the PR that I pointed earlier.

    If you can, update your SDK base to get latest fixes from Zephyr LwM2M engine. There has been some work that have been done to queue mode, DTLS and configuring of update frequencies.

    As a separate thing: You mention that you use Composite-Read on every update?
    Why don't you have the logic in the application side? LwM2M SEND can be used if your application already know what data you want to send to server. Then if needed, it will trigger and update. In your case, it may allow flexibility on update frequency, but also saves one READ command.

Children
No Data