This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Blocking socket calls in MQTT library

We've been chasing a seemingly randomly appearing bug, where a call to the Zephyr MQTT library seems to sometimes block indefinitely. More specifically the hanging function call seems to be zsock_send inside mqtt_transport_tls_write. We are using socket offloading so I assume the socket implementation in use is from the nrf_modem library. The top level MQTT library function which we call and which leads to that socket call is mqtt_ping. Our application runs just fine for anywhere between 30 minutes to 20 hours so there are many successful calls before the hanging one. We are using the MQTT library from 2 different threads but I assume that's fine since the MQTT library functions use a mutex.

In addition to the MQTT traffic via LTE we are also using the GPS driver to get position fixes. Our application logic is such that we've caught cases where we try to stop the gps after the MQTT library socket call hangs. In that case the function call nrf_setsockopt inside the gps driver also fails (returns -1 and eventually -EIO to our application) or it hangs too. The function to stop the gps seems to fail/hang only if the MQTT socket call has gone hanging before it.

Another symptom is that we don't receive PUBACK MQTT messages for some of the the sent packets just before the hang happens. We are listening for those packets but for some reason they don't reach the MQTT library it seems. This then leads us to believe that the fail case is somehow related to the network connection. We've tried manually disconnecting the network during MQTT sending. In that case the MQTT library calls do not hang and return errors correctly which leads to our application recovering after reconnecting.

The hardware we're using is a board with NRF52840 and NRF9160 revision 1 with the 1.3.0 modem firmware. The software versions are sdk-nrf 1.6.0, sdk-zephyr 2.6.0-rc1-ncs1, nrfxlib 1.6.0 and nrfx 2.5.0.

Right now we would be interested in finding out the root cause of this issue, but more importably we would like to prevent these seemingly indefinite hangs. Is there something we could do in order to make the socket calls not hang?

  • Hey!

    Sorry for not posting any updates but I've set aside this project for now in order to work on other stuff. I did implement/apply some of the fixes you suggested such as using the MQTT library from only one thread and using those config options to increase the sendmsg_buf/heap. I also made the MQTT control much more "sensitive" to failures in that I immediately stop trying to send if anything fails. I also switched to using QoS 0 for the sends instead of QoS 1.

    Now some combination of those fixes did improve the run time greatly and it seems we get 100+ hours of run time before something breaks and a reset happens. I haven't looked at the logs from those resets so it might be that they are not even related to this. I'll continue on this issue when I can allocate some time for it and maybe discover some more answers. I'll post any updates here.

    Btw, could you elaborate why mutexes don't guarantee synchronization in the use case I described before? That is, two threads of equal (preemptible) priority using the MQTT library functions which all seem to use the same mutex.

  • Hi, I willl handle your case as Carl Richard is currently out of office. 

    Riku Karttunen said:
    I'll continue on this issue when I can allocate some time for it and maybe discover some more answers. I'll post any updates here.

    Yes, please keep us updated if you find anything.

     

    Riku Karttunen said:
    Btw, could you elaborate why mutexes don't guarantee synchronization in the use case I described before? That is, two threads of equal (preemptible) priority using the MQTT library functions which all seem to use the same mutex.

     The previous comment was based on an observation and the available information according to the developer. Further he responds: Not all system functions in Zephyr are thread safe so it was one thing worth looking into. I don't know if the customer is using their own mutexes for the call, but from what they say it seems they are. In that case then multi-thread usage should be fine but this information wasn't available one month ago.

    Kind regards,
    Øyvind

Related