We've been chasing a seemingly randomly appearing bug, where a call to the Zephyr MQTT library seems to sometimes block indefinitely. More specifically the hanging function call seems to be zsock_send inside mqtt_transport_tls_write. We are using socket offloading so I assume the socket implementation in use is from the nrf_modem library. The top level MQTT library function which we call and which leads to that socket call is mqtt_ping. Our application runs just fine for anywhere between 30 minutes to 20 hours so there are many successful calls before the hanging one. We are using the MQTT library from 2 different threads but I assume that's fine since the MQTT library functions use a mutex.
In addition to the MQTT traffic via LTE we are also using the GPS driver to get position fixes. Our application logic is such that we've caught cases where we try to stop the gps after the MQTT library socket call hangs. In that case the function call nrf_setsockopt inside the gps driver also fails (returns -1 and eventually -EIO to our application) or it hangs too. The function to stop the gps seems to fail/hang only if the MQTT socket call has gone hanging before it.
Another symptom is that we don't receive PUBACK MQTT messages for some of the the sent packets just before the hang happens. We are listening for those packets but for some reason they don't reach the MQTT library it seems. This then leads us to believe that the fail case is somehow related to the network connection. We've tried manually disconnecting the network during MQTT sending. In that case the MQTT library calls do not hang and return errors correctly which leads to our application recovering after reconnecting.
The hardware we're using is a board with NRF52840 and NRF9160 revision 1 with the 1.3.0 modem firmware. The software versions are sdk-nrf 1.6.0, sdk-zephyr 2.6.0-rc1-ncs1, nrfxlib 1.6.0 and nrfx 2.5.0.
Right now we would be interested in finding out the root cause of this issue, but more importably we would like to prevent these seemingly indefinite hangs. Is there something we could do in order to make the socket calls not hang?