This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Blocking socket calls in MQTT library

We've been chasing a seemingly randomly appearing bug, where a call to the Zephyr MQTT library seems to sometimes block indefinitely. More specifically the hanging function call seems to be zsock_send inside mqtt_transport_tls_write. We are using socket offloading so I assume the socket implementation in use is from the nrf_modem library. The top level MQTT library function which we call and which leads to that socket call is mqtt_ping. Our application runs just fine for anywhere between 30 minutes to 20 hours so there are many successful calls before the hanging one. We are using the MQTT library from 2 different threads but I assume that's fine since the MQTT library functions use a mutex.

In addition to the MQTT traffic via LTE we are also using the GPS driver to get position fixes. Our application logic is such that we've caught cases where we try to stop the gps after the MQTT library socket call hangs. In that case the function call nrf_setsockopt inside the gps driver also fails (returns -1 and eventually -EIO to our application) or it hangs too. The function to stop the gps seems to fail/hang only if the MQTT socket call has gone hanging before it.

Another symptom is that we don't receive PUBACK MQTT messages for some of the the sent packets just before the hang happens. We are listening for those packets but for some reason they don't reach the MQTT library it seems. This then leads us to believe that the fail case is somehow related to the network connection. We've tried manually disconnecting the network during MQTT sending. In that case the MQTT library calls do not hang and return errors correctly which leads to our application recovering after reconnecting.

The hardware we're using is a board with NRF52840 and NRF9160 revision 1 with the 1.3.0 modem firmware. The software versions are sdk-nrf 1.6.0, sdk-zephyr 2.6.0-rc1-ncs1, nrfxlib 1.6.0 and nrfx 2.5.0.

Right now we would be interested in finding out the root cause of this issue, but more importably we would like to prevent these seemingly indefinite hangs. Is there something we could do in order to make the socket calls not hang?

Parents
  • Hello, Riku!

    Thanks for reaching out. This could seem like the modem is rendered unresponsive. How large is the typical payload size of your messages? Please try to enable the modem library Diagnostic functions as well so that we can see if there are any memory issues.

    Best regards,
    Carl Richard

  • Our payload sizes are always between 100 and 140 bytes. We send about 10 of those consecutively every 1 minute or so. I enabled the modem diagnostics and printed those out every time before the procedures leading up to the blocking call.

    Here is a log snippet that nicely shows the diagnostic prints before a successful sequence of operations and also before the failure case which happens 1 minute later and leads to a 3+ hour hang. Those "entry/exit" prints are something I added in the beginning and end of the functions that contain the suspected blocking socket calls (inside MQTT and GPS libraries).

    nrf_modem heap dump:
    Heap at 0x20016dc8 contains 127 units in 7 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            1            2            1            3           20
            6           64            1           95          756
    
    776 free bytes, 148 allocated bytes, overhead = 96 bytes (9.4%)
    Failed allocations: 0
    nrf_modem tx dump:
    Heap at 0x200104e8 contains 1023 units in 10 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            9          512            1         1016         8124
    
    8124 free bytes, 0 allocated bytes, overhead = 64 bytes (0.8%)
    Failed allocations: 0
    +CSCON: 1
    [03:07:12.839,538] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:12.839,660] <inf> net_mqtt_sock_tls: mqtt_client_tls_write entry
    [03:07:12.839,691] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x20010524
    [03:07:12.840,515] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.840,576] <inf> net_mqtt_sock_tls: mqtt_client_tls_write exit
    [03:07:12.945,953] <dbg> cgw_main.data_send_entry: Stopping GPS...
    [03:07:12.945,983] <inf> nrf9160_gps: stop_gps entry
    [03:07:12.946,014] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:12.946,502] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.946,563] <inf> nrf9160_gps: stop_gps exit
    [03:07:12.946,624] <dbg> cgw_gps.gps_handler: GPS_EVT_SEARCH_STOPPED
    [03:07:12.950,958] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1831
    [03:07:12.950,988] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.951,019] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.951,599] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.951,751] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.952,484] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.952,636] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.953,033] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1832
    [03:07:12.953,063] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.953,094] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.953,674] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.953,826] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:12.954,650] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.954,772] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.955,230] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1833
    [03:07:12.955,261] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.955,322] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.955,871] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.956,024] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.957,122] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.957,275] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.957,672] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1834
    [03:07:12.957,702] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.957,733] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.958,282] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.958,435] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:12.959,320] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.959,442] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.959,899] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1835
    [03:07:12.959,930] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.959,991] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.960,937] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.961,090] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.962,036] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.962,188] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.962,615] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1836
    [03:07:12.962,646] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.962,707] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.963,256] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.963,409] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.964,202] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.964,324] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.964,721] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1837
    [03:07:12.964,782] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.964,813] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.965,362] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.965,515] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:12.966,186] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.966,308] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:12.966,766] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1838
    [03:07:12.966,796] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:12.966,827] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:12.967,376] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:12.967,529] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:12.968,383] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.156,341] <inf> nrf_modem_lib: shm_tx_alloc(129) -> 0x20010524
    [03:07:13.156,616] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.156,921] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.156,982] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:13.157,409] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1839
    [03:07:13.157,562] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:13.157,623] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:13.158,050] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.189,483] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:13.189,971] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.190,093] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:13.191,040] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.191,131] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:13.191,558] <dbg> cgw_mqtt.data_publish: Calling mqtt_publish for message id 1840
    [03:07:13.191,619] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg entry
    [03:07:13.191,680] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524
    [03:07:13.192,230] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.192,321] <inf> nrf_modem_lib: shm_tx_alloc(106) -> 0x20010524
    [03:07:13.192,993] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.193,054] <inf> net_mqtt_sock_tls: mqtt_client_tls_write_msg exit
    [03:07:13.193,115] <dbg> cgw_main.data_send_entry: GPS data sent
    [03:07:13.193,115] <dbg> cgw_main.data_send_entry: Starting GPS...
    [03:07:13.193,206] <inf> nrf_modem_lib: shm_tx_alloc(15) -> 0x20010524
    [03:07:13.193,298] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.194,213] <inf> nrf_modem_lib: shm_tx_alloc(8) -> 0x20010524
    [03:07:13.194,458] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.195,220] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.196,533] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.196,685] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.198,120] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.198,242] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.199,615] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.199,737] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:13.200,256] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:13.391,998] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.392,150] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1831
    [03:07:13.392,333] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.392,547] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1832
    [03:07:13.392,669] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.392,822] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1833
    [03:07:13.392,974] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.393,218] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1834
    [03:07:13.393,310] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.393,493] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1835
    [03:07:13.573,394] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.573,638] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1836
    [03:07:13.573,730] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.573,913] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1837
    [03:07:13.574,249] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:13.574,401] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1838
    [03:07:14.270,538] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:14.270,721] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1839
    [03:07:14.270,935] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:07:14.271,087] <dbg> cgw_main.mqtt_evt_handler: PUBACK packet id: 1840
    +CSCON: 0
    [03:07:34.448,791] <dbg> cgw_gps.gps_handler: GPS_EVT_OPERATION_UNBLOCKED
    [03:07:38.605,621] <dbg> cgw_gps.gps_handler: GPS_EVT_OPERATION_BLOCKED
    [03:07:48.605,743] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:48.606,292] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:48.606,384] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x20010524
    [03:07:48.606,597] <inf> nrf_modem_lib: shm_tx_free(0x20010524)
    [03:07:53.333,496] <dbg> cgw_gps.gps_handler: GPS_EVT_SEARCH_TIMEOUT
    nrf_modem heap dump:
    Heap at 0x20016dc8 contains 127 units in 7 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            1            2            1            3           20
            6           64            1           95          756
    
    776 free bytes, 148 allocated bytes, overhead = 96 bytes (9.4%)
    Failed allocations: 0
    nrf_modem tx dump:
    Heap at 0x200104e8 contains 1023 units in 10 buckets
    
      bucket#    min units        total      largest      largest
                 threshold       chunks      (units)      (bytes)
      -----------------------------------------------------------
            9          512            1         1015         8116
    
    8116 free bytes, 4 allocated bytes, overhead = 68 bytes (0.8%)
    Failed allocations: 0
    [03:08:13.193,786] <dbg> cgw_main.main: Calling mqtt_input/mqtt_live...
    [03:08:13.193,908] <inf> net_mqtt_sock_tls: mqtt_client_tls_write entry
    [03:08:13.193,969] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x20010524
    [03:08:13.300,964] <dbg> cgw_main.data_send_entry: Stopping GPS...
    [03:08:13.300,994] <inf> nrf9160_gps: stop_gps entry
    [03:08:13.301,025] <inf> nrf_modem_lib: shm_tx_alloc(20) -> 0x2001052c
    <end of log prints until I did a manual hard reset 3 hours later>

    We have 2 of our own threads running here. A main thread that polls the MQTT socket and calls mqtt_live/mqtt_input and a separate thread that stops/starts the GPS and calls mqtt_publish to send data periodically. These threads have no synchronization between them. The timestamps show the run time which was about 3 hours at the time of failure.

  • Hello again!

    Apologies for the delayed answer here. A bug leading to a behavior similar to the one you describe was recently discovered in the modem library. Could you try to apply the following pull request (remember west update afterwards) and see if the issue persists?

    If that doesn't help please try to do a modem trace for me to investigate.

    Best regards,
    Carl Richard

  • I applied the pull request you gave me and updated everything. The problem didn't seem to go away but may actually have gotten worse Smiley in that it's appearing more frequently now. My run times are now between 5-20 minutes before the problem appears.

    I'm now looking into collecting the modem trace. We have a slight complication in that we are using a custom board which has peripherals enabled in such a way that uart1 has to be disabled. It seems that the modem tracing uses exactly that uart for the traces by default. Is there a recommended way to collect the traces in this situation? We are using uart0 for general logging and uart2 for the interface to nrf52840.

Reply
  • I applied the pull request you gave me and updated everything. The problem didn't seem to go away but may actually have gotten worse Smiley in that it's appearing more frequently now. My run times are now between 5-20 minutes before the problem appears.

    I'm now looking into collecting the modem trace. We have a slight complication in that we are using a custom board which has peripherals enabled in such a way that uart1 has to be disabled. It seems that the modem tracing uses exactly that uart for the traces by default. Is there a recommended way to collect the traces in this situation? We are using uart0 for general logging and uart2 for the interface to nrf52840.

Children
  • Hi again, Riku!

    That was unfortunate. Sounds like there is some kind of memory leak happening here. Does anything improve if you increase CONFIG_NRF_MODEM_LIB_HEAP_SIZE to, say, 4096?

    Given your limitations I see two alternatives for enabling modem trace. Both will require logging to be disabled while the trace is captured. 

    - Move any peripheral that requires UART1 to be disabled to use the <0> instance instead. Then move logging to UART1 and disable it when doing the modem trace. This will not require any change in SDK source code, but can of course be tedious as it requires some changes to your application.
    - Change nrf_modem_os.c to use UART0 instead of UART1 like described by Simon in this thread and disable logging. Then the trace can be read from UART0. 

    We are working on making RTT a possible medium for modem tracing, but it's not in a working state yet. 

    Could you also do another run with the modem library diagnostics enabled (with the pull request applied). Then I could have a look at the allocations again to see if there are any differences from your earlier logs. 

    Best regards,
    Carl Richard

  • I managed transfer our general logging to use RTT and modify the modem trace code to use one of our available UARTs. I managed to capture a modem trace that contains at least 5 occasions of the bug, one being right at the end of the tracing session. The run times before the bug happens were between 1 hour and 10 minutes for this tracing session.

    trace-2021-08-10T17-27-40.389Z.bin

    Also a question related to the modem traces; does the proprietary format allow for other UART prints (LOG_*/printk plaintext prints) to appear mixed in with it? The trace I've collected doesn't contain any but I suppose it could be useful to have some mixed in to help with troubleshooting in future cases.

    The modem library memory diagnostic prints with the patch applied looked pretty much the same as without it. I have not yet tried increasing the heap for the modem library but I'll give it a go, thanks!

  • Hi again!

    Sounds great. Thanks for the trace! I've sent it in for analysis. 

    The modem traces does not currently allow for LOG/printk statements, unfortunately. I agree that it would be a good addition, but as for now we must stick to matching logs with traces based on timing.  

    A colleague suggested the following values for modem library configuration:
     

    CONFIG_NRF_MODEM_LIB_HEAP_SIZE=4096
    CONFIG_NRF_MODEM_LIB_SENDMSG_BUF_SIZE=2048


    Please report back whether they improve the situation or not!

    Best regards,
    Carl Richard

  • Thank you, I'll give those config options a try. I suppose the accurate timings of the failures could be useful to you so I'll put them here. These logs are the last thing printed in the application log before the freeze occurs. I think I started the modem trace roughly at the same time so the timing should match pretty well. The timestamp shows the run time after each boot. These should be all the failures from the tracing session.

    [00:55:31.000,885] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x2001053c

    [00:08:26.968,750] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:15:38.345,062] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:20:05.382,843] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:02:01.370,574] <inf> nrf_modem_lib: shm_tx_alloc(2) -> 0x2001053c

    [00:12:31.493,865] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:50:16.728,607] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

    [00:05:32.337,921] <inf> nrf_modem_lib: shm_tx_alloc(36) -> 0x20010524

  • Thanks for the additional logs! The developers suggest trying to only call the MQTT library from a single thread rather than two separate. A mutex won't guarantee perfect synchronization unfortunately, which again may cause a deadlock. You could for example use work items or something similar to do the MQTT operations.

    Best regards,
    Carl Richard

Related