Calls to MQTT library hang indefinitely when client is not connected.

Hello!

I'm writing an application that uses MQTT to post some sensor data to a server. My networking code is basically a duplicate of this sample, except I set up the LTE connection before starting the network loop, have a larger volume of data to transfer, and read sensor values from serial in a different thread.

Occasionally the MQTT event handler function receives MQTT_EVT_DISCONNECT with reason code 113 (ECONNABORTED) right before making a network call (mqtt_ping(), mqtt_connect(), &c.). In that case, the network call seems to hang indefinitely. I can't tell how the example application copes with this—do I have to re-run try_to_connect() immediately before every network operation? I'd be happy if the library calls just timed out, instead of blocking indefinitely. Is there a config option for that?

As far as I can see, the documentation doesn't specify any invariants I need to uphold beside "mqtt_client struct's memory shall be resident throughout the program's lifetime." Is there some condition I'm missing? I never access any part of the MQTT data structures or library from any thread besides my networking thread, and I've verified there are no deadlocks, &c. I'm certain it is the library functions that are hanging.

Thanks in advance!

—EMB

Parents
  • Hi,

    Do you get any errors when this "hanging" takes place? 

    Occasionally the MQTT event handler function receives MQTT_EVT_DISCONNECT with reason code 113 (ECONNABORTED) right before making a network call (mqtt_ping(), mqtt_connect(), &c.).

    Can you provide a complete log which shows ECONNABORTED?

    How does your application recover from indefinite hanging?

    Best regards,
    Dejan

  • 00> [00:04:28.296,356] <inf> tracing: ### CALLING MQTT_CONNECT().
    00> [00:04:34.280,670] <inf> tracing: ### CALLED MQTT_CONNECT().
    00> [00:04:34.281,036] <inf> tracing: ###### PASSED MQTT_CONNECT()!
    00> [00:04:34.281,402] <inf> tracing: ###### EXITED PREPARE_FDS().
    00> [00:04:34.281,738] <inf> tracing: ### CALLING WAIT().
    00> [00:04:34.282,073] <inf> tracing: ###### NFDS > 0.
    00> [00:04:36.282,806] <inf> tracing: ###### POLLED.
    00> [00:04:38.289,764] <inf> tracing: ### * MQTT DISCONNECTED WITH ERROR -113! *
    00> [00:04:38.290,588] <inf> tracing: ### CALLING MQTT_CONNECT().

    This is a sample of the logs. The last line you see here was the last line logged—the call to mqtt_connect() blocked for hours, until I reset the board. As far as other errors, no, I don't get any. I've examined modem traces, and the network calls always seem to block after a TCP retransmit. However, the application may make many successful retransmissions before any blocking or hanging happens. The MQTT logging hasn't told me anything yet.

    My application doesn't recover from this—that's the issue. The library calls sometimes block indefinitely, and I can't "recover" in any way, unless I preempt that operation and kill it. I'd rather not have to kill my networking thread, though.

  • Hi,

    Do you see the same issue when using unmodified Zephyr mqtt_publisher sample?

    Can you upload your project so that I can try to reproduce the issue?

    Best regards,
    Dejan

  • Hello Dejan,

    I've concluded the issue is with the broker—it is ignoring the connection when too many are attempted within a short time. Occasionally I see the ETIMEDOUT (-116) error before it attempts another connection, and it's that connection which will hang indefinitely. I've solved the issue by adding exponential backoff on the retries.

    Not sure why another ETIMEDOUT error wouldn't be raised—that seems like strange behavior for the library—but it's no longer affecting me, so I'll move on.

    Thank you for your help and time!

  • Hi,

    I am glad to hear that your issue has been resolved.
    Thank you for the update.

    Best regards,
    Dejan

  • It looks like the issue has resurfaced. Over the last week, I've been digging through the library code, and it seems the blocking call is zsock_sendmsg() in mqtt_client_tls_write_msg(). If a socket does not receive a close signal, or encounters some other issue, it would remain in a waiting state indefinitely according to the BSD sockets spec.

    However, this will never return control to the application, and in an environment where the network is guaranteed to be unreliable, this feels like an issue.

    How do you recommend I proceed? Should I open an issue in GitHub? I've attached the MQTT library's log below. The last line "Sending data..." is a log statement I added right before the call to zsock_sendmsg(). After 13 hours, it had still not returned.

    [00:11:07.202,514] <dbg> net_mqtt: mqtt_publish: (net_thread): [CID 0x2000d2d8]:[State 0x06]: >> Topic size 0x0000001b, Data size 0x0000046b
    [00:11:07.216,430] <dbg> net_mqtt_enc: pack_utf8_str: (net_thread): >> str_size:0000001d cur:0x20013d25, end:0x20014520
    [00:11:07.228,210] <dbg> net_mqtt_enc: pack_uint16: (net_thread): >> val:001b cur:0x20013d25, end:0x20014520
    [00:11:07.238,983] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (net_thread): << msg type:0x31 length:0x00000488
    [00:11:07.250,213] <dbg> net_mqtt_enc: packet_length_encode: (net_thread): >> length:0x00000488 cur:0, end:0
    [00:11:07.260,986] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (net_thread): Fixed header length = 03
    [00:11:07.271,301] <dbg> net_mqtt_enc: pack_uint8: (net_thread): >> val:31 cur:0x20013d22, end:0x20014520
    [00:11:07.281,829] <dbg> net_mqtt_enc: packet_length_encode: (net_thread): >> length:0x00000488 cur:0x20013d23, end:0x20014520
    [00:11:07.294,219] <dbg> net_mqtt: client_write_msg: (net_thread): [0x2000d2d8]: Transport writing message.
    [00:11:07.304,901] <dbg> net_mqtt_sock_tls: mqtt_client_tls_write_msg: (net_thread): Sending data...

  • I should also note I'm using 
    CONFIG_NET_SOCKETS_OFFLOAD=y
    and
    CONFIG_MODEM_SOCKET=y.

Reply Children
No Data
Related