This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Blocking socket calls in MQTT library

We've been chasing a seemingly randomly appearing bug, where a call to the Zephyr MQTT library seems to sometimes block indefinitely. More specifically the hanging function call seems to be zsock_send inside mqtt_transport_tls_write. We are using socket offloading so I assume the socket implementation in use is from the nrf_modem library. The top level MQTT library function which we call and which leads to that socket call is mqtt_ping. Our application runs just fine for anywhere between 30 minutes to 20 hours so there are many successful calls before the hanging one. We are using the MQTT library from 2 different threads but I assume that's fine since the MQTT library functions use a mutex.

In addition to the MQTT traffic via LTE we are also using the GPS driver to get position fixes. Our application logic is such that we've caught cases where we try to stop the gps after the MQTT library socket call hangs. In that case the function call nrf_setsockopt inside the gps driver also fails (returns -1 and eventually -EIO to our application) or it hangs too. The function to stop the gps seems to fail/hang only if the MQTT socket call has gone hanging before it.

Another symptom is that we don't receive PUBACK MQTT messages for some of the the sent packets just before the hang happens. We are listening for those packets but for some reason they don't reach the MQTT library it seems. This then leads us to believe that the fail case is somehow related to the network connection. We've tried manually disconnecting the network during MQTT sending. In that case the MQTT library calls do not hang and return errors correctly which leads to our application recovering after reconnecting.

The hardware we're using is a board with NRF52840 and NRF9160 revision 1 with the 1.3.0 modem firmware. The software versions are sdk-nrf 1.6.0, sdk-zephyr 2.6.0-rc1-ncs1, nrfxlib 1.6.0 and nrfx 2.5.0.

Right now we would be interested in finding out the root cause of this issue, but more importably we would like to prevent these seemingly indefinite hangs. Is there something we could do in order to make the socket calls not hang?

Parents
  • Hello, Riku!

    Thanks for reaching out. This could seem like the modem is rendered unresponsive. How large is the typical payload size of your messages? Please try to enable the modem library Diagnostic functions as well so that we can see if there are any memory issues.

    Best regards,
    Carl Richard

  • Our payload sizes are always between 100 and 140 bytes. We send about 10 of those consecutively every 1 minute or so. I enabled the modem diagnostics and printed those out every time before the procedures leading up to the blocking call.

    Here is a log snippet that nicely shows the diagnostic prints before a successful sequence of operations and also before the failure case which happens 1 minute later and leads to a 3+ hour hang. Those "entry/exit" prints are something I added in the beginning and end of the functions that contain the suspected blocking socket calls (inside MQTT and GPS libraries).

    nrf_modem heap dump:
    Heap at 0x20016dc8 contains 127 units in 7 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            1            2            1            3           20
            6           64            1           95          756
    
    776 free bytes, 148 allocated bytes, overhead = 96 bytes (9.4%)
    Failed allocations: 0
    nrf_modem tx dump:
    Heap at 0x200104e8 contains 1023 units in 10 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            9          512            1         1016         8124
    
    8124 free bytes, 0 allocated bytes, overhead = 64 bytes (0.8%)
    Failed allocations: 0
    +CSCON: 1
    [03:07:12.839,538] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:12.839,660] <inf> net_mqtt_sock_tls: mqtt_client_tls_write entry
    [03:07:12.839,691] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x20010524
    [03:07:12.840,515] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.840,576] <inf> net_mqtt_sock_tls: mqtt_client_tls_write exit
    [03:07:12.945,953] <dbg> cgw_main.data_send_entry: Stopping GPS...
    [03:07:12.945,983] <inf> nrf9160_gps: stop_gps entry
    [03:07:12.946,014] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:12.946,502] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.946,563] <inf> nrf9160_gps: stop_gps exit
    [03:07:12.946,624] <dbg> cgw_gps.gps_handler: GPS_EVT_SEARCH_STOPPED
    [03:07:12.950,958] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1831
    [03:07:12.950,988] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.951,019] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.951,599] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.951,751] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.952,484] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.952,636] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.953,033] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1832
    [03:07:12.953,063] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.953,094] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.953,674] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.953,826] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:12.954,650] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.954,772] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.955,230] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1833
    [03:07:12.955,261] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.955,322] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.955,871] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.956,024] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.957,122] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.957,275] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.957,672] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1834
    [03:07:12.957,702] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.957,733] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.958,282] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.958,435] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:12.959,320] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.959,442] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.959,899] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1835
    [03:07:12.959,930] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.959,991] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.960,937] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.961,090] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.962,036] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.962,188] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.962,615] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1836
    [03:07:12.962,646] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.962,707] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.963,256] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.963,409] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.964,202] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.964,324] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.964,721] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1837
    [03:07:12.964,782] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.964,813] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.965,362] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.965,515] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:12.966,186] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.966,308] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.966,766] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1838
    [03:07:12.966,796] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.966,827] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.967,376] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.967,529] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.968,383] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.156,341] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:13.156,616] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.156,921] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.156,982] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:13.157,409] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1839
    [03:07:13.157,562] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:13.157,623] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:13.158,050] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.189,483] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:13.189,971] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.190,093] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:13.191,040] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.191,131] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:13.191,558] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1840
    [03:07:13.191,619] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:13.191,680] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:13.192,230] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.192,321] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:13.192,993] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.193,054] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:13.193,115] <dbg> cgw_main.data_send_entry: GPS data sent
    [03:07:13.193,115] <dbg> cgw_main.data_send_entry: Starting GPS...
    [03:07:13.193,206] <inf> nrf_modem_lib: shm_tx_alloc(15) -> 0x20010524
    [03:07:13.193,298] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.194,213] <inf> nrf_modem_lib: shm_tx_alloc(8) -> 0x20010524
    [03:07:13.194,458] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.195,220] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.196,533] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.196,685] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.198,120] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.198,242] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.199,615] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.199,737] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.200,256] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.391,998] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.392,150] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1831
    [03:07:13.392,333] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.392,547] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1832
    [03:07:13.392,669] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.392,822] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1833
    [03:07:13.392,974] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.393,218] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1834
    [03:07:13.393,310] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.393,493] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1835
    [03:07:13.573,394] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.573,638] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1836
    [03:07:13.573,730] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.573,913] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1837
    [03:07:13.574,249] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.574,401] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1838
    [03:07:14.270,538] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:14.270,721] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1839
    [03:07:14.270,935] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:14.271,087] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1840
    +CSCON: 0
    [03:07:34.448,791] <dbg> cgw_gps.gps_handler: GPS_EVT_OPERATION_UNBLOCKED
    [03:07:38.605,621] <dbg> cgw_gps.gps_handler: GPS_EVT_OPERATION_BLOCKED
    [03:07:48.605,743] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:48.606,292] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:48.606,384] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:48.606,597] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:53.333,496] <dbg> cgw_gps.gps_handler: GPS_EVT_SEARCH_TIMEOUT
    nrf_modem heap dump:
    Heap at 0x20016dc8 contains 127 units in 7 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            1            2            1            3           20
            6           64            1           95          756
    
    776 free bytes, 148 allocated bytes, overhead = 96 bytes (9.4%)
    Failed allocations: 0
    nrf_modem tx dump:
    Heap at 0x200104e8 contains 1023 units in 10 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            9          512            1         1015         8116
    
    8116 free bytes, 4 allocated bytes, overhead = 68 bytes (0.8%)
    Failed allocations: 0
    [03:08:13.193,786] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:08:13.193,908] <inf> net_mqtt_sock_tls: mqtt_client_tls_write entry
    [03:08:13.193,969] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x20010524
    [03:08:13.300,964] <dbg> cgw_main.data_send_entry: Stopping GPS...
    [03:08:13.300,994] <inf> nrf9160_gps: stop_gps entry
    [03:08:13.301,025] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x2001052c
    <end of log prints until I did a manual hard reset 3 hours later>

    We have 2 of our own threads running here. A main thread that polls the MQTT socket and calls mqtt_live/mqtt_input and a separate thread that stops/starts the GPS and calls mqtt_publish to send data periodically. These threads have no synchronization between them. The timestamps show the run time which was about 3 hours at the time of failure.

  • Thank you, I'll give those config options a try. I suppose the accurate timings of the failures could be useful to you so I'll put them here. These logs are the last thing printed in the application log before the freeze occurs. I think I started the modem trace roughly at the same time so the timing should match pretty well. The timestamp shows the run time after each boot. These should be all the failures from the tracing session.

    [00:55:31.000,885] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x2001053c

    [00:08:26.968,750] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:15:38.345,062] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:20:05.382,843] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:02:01.370,574] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x2001053c

    [00:12:31.493,865] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:50:16.728,607] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:05:32.337,921] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

  • Thanks for the additional logs! The developers suggest trying to only call the MQTT library from a single thread rather than two separate. A mutex won't guarantee perfect synchronization unfortunately, which again may cause a deadlock. You could for example use work items or something similar to do the MQTT operations.

    Best regards,
    Carl Richard

  • Hello again!

    Did my suggestions from earlier work for you? 

    Best regards,
    Carl Richard

  • Hey!

    Sorry for not posting any updates but I've set aside this project for now in order to work on other stuff. I did implement/apply some of the fixes you suggested such as using the MQTT library from only one thread and using those config options to increase the sendmsg_buf/heap. I also made the MQTT control much more "sensitive" to failures in that I immediately stop trying to send if anything fails. I also switched to using QoS 0 for the sends instead of QoS 1.

    Now some combination of those fixes did improve the run time greatly and it seems we get 100+ hours of run time before something breaks and a reset happens. I haven't looked at the logs from those resets so it might be that they are not even related to this. I'll continue on this issue when I can allocate some time for it and maybe discover some more answers. I'll post any updates here.

    Btw, could you elaborate why mutexes don't guarantee synchronization in the use case I described before? That is, two threads of equal (preemptible) priority using the MQTT library functions which all seem to use the same mutex.

  • Hi, I willl handle your case as Carl Richard is currently out of office. 

    Riku Karttunen said:
    I'll continue on this issue when I can allocate some time for it and maybe discover some more answers. I'll post any updates here.

    Yes, please keep us updated if you find anything.

     

    Riku Karttunen said:
    Btw, could you elaborate why mutexes don't guarantee synchronization in the use case I described before? That is, two threads of equal (preemptible) priority using the MQTT library functions which all seem to use the same mutex.

     The previous comment was based on an observation and the available information according to the developer. Further he responds: Not all system functions in Zephyr are thread safe so it was one thing worth looking into. I don't know if the customer is using their own mutexes for the call, but from what they say it seems they are. In that case then multi-thread usage should be fine but this information wasn't available one month ago.

    Kind regards,
    Øyvind

Reply
  • Hi, I willl handle your case as Carl Richard is currently out of office. 

    Riku Karttunen said:
    I'll continue on this issue when I can allocate some time for it and maybe discover some more answers. I'll post any updates here.

    Yes, please keep us updated if you find anything.

     

    Riku Karttunen said:
    Btw, could you elaborate why mutexes don't guarantee synchronization in the use case I described before? That is, two threads of equal (preemptible) priority using the MQTT library functions which all seem to use the same mutex.

     The previous comment was based on an observation and the available information according to the developer. Further he responds: Not all system functions in Zephyr are thread safe so it was one thing worth looking into. I don't know if the customer is using their own mutexes for the call, but from what they say it seems they are. In that case then multi-thread usage should be fine but this information wasn't available one month ago.

    Kind regards,
    Øyvind

Children
No Data
Related