This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Blocking socket calls in MQTT library

We've been chasing a seemingly randomly appearing bug, where a call to the Zephyr MQTT library seems to sometimes block indefinitely. More specifically the hanging function call seems to be zsock_send inside mqtt_transport_tls_write. We are using socket offloading so I assume the socket implementation in use is from the nrf_modem library. The top level MQTT library function which we call and which leads to that socket call is mqtt_ping. Our application runs just fine for anywhere between 30 minutes to 20 hours so there are many successful calls before the hanging one. We are using the MQTT library from 2 different threads but I assume that's fine since the MQTT library functions use a mutex.

In addition to the MQTT traffic via LTE we are also using the GPS driver to get position fixes. Our application logic is such that we've caught cases where we try to stop the gps after the MQTT library socket call hangs. In that case the function call nrf_setsockopt inside the gps driver also fails (returns -1 and eventually -EIO to our application) or it hangs too. The function to stop the gps seems to fail/hang only if the MQTT socket call has gone hanging before it.

Another symptom is that we don't receive PUBACK MQTT messages for some of the the sent packets just before the hang happens. We are listening for those packets but for some reason they don't reach the MQTT library it seems. This then leads us to believe that the fail case is somehow related to the network connection. We've tried manually disconnecting the network during MQTT sending. In that case the MQTT library calls do not hang and return errors correctly which leads to our application recovering after reconnecting.

The hardware we're using is a board with NRF52840 and NRF9160 revision 1 with the 1.3.0 modem firmware. The software versions are sdk-nrf 1.6.0, sdk-zephyr 2.6.0-rc1-ncs1, nrfxlib 1.6.0 and nrfx 2.5.0.

Right now we would be interested in finding out the root cause of this issue, but more importably we would like to prevent these seemingly indefinite hangs. Is there something we could do in order to make the socket calls not hang?

  • I managed transfer our general logging to use RTT and modify the modem trace code to use one of our available UARTs. I managed to capture a modem trace that contains at least 5 occasions of the bug, one being right at the end of the tracing session. The run times before the bug happens were between 1 hour and 10 minutes for this tracing session.

    trace-2021-08-10T17-27-40.389Z.bin

    Also a question related to the modem traces; does the proprietary format allow for other UART prints (LOG_*/printk plaintext prints) to appear mixed in with it? The trace I've collected doesn't contain any but I suppose it could be useful to have some mixed in to help with troubleshooting in future cases.

    The modem library memory diagnostic prints with the patch applied looked pretty much the same as without it. I have not yet tried increasing the heap for the modem library but I'll give it a go, thanks!

  • Hi again!

    Sounds great. Thanks for the trace! I've sent it in for analysis. 

    The modem traces does not currently allow for LOG/printk statements, unfortunately. I agree that it would be a good addition, but as for now we must stick to matching logs with traces based on timing.  

    A colleague suggested the following values for modem library configuration:
     

    CONFIG_NRF_MODEM_LIB_HEAP_SIZE=4096
    CONFIG_NRF_MODEM_LIB_SENDMSG_BUF_SIZE=2048


    Please report back whether they improve the situation or not!

    Best regards,
    Carl Richard

  • Thank you, I'll give those config options a try. I suppose the accurate timings of the failures could be useful to you so I'll put them here. These logs are the last thing printed in the application log before the freeze occurs. I think I started the modem trace roughly at the same time so the timing should match pretty well. The timestamp shows the run time after each boot. These should be all the failures from the tracing session.

    [00:55:31.000,885] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x2001053c

    [00:08:26.968,750] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:15:38.345,062] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:20:05.382,843] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:02:01.370,574] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x2001053c

    [00:12:31.493,865] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:50:16.728,607] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:05:32.337,921] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

  • Thanks for the additional logs! The developers suggest trying to only call the MQTT library from a single thread rather than two separate. A mutex won't guarantee perfect synchronization unfortunately, which again may cause a deadlock. You could for example use work items or something similar to do the MQTT operations.

    Best regards,
    Carl Richard

  • Hello again!

    Did my suggestions from earlier work for you? 

    Best regards,
    Carl Richard

Related