MQTT Connection Resets - Keepalive

I'm hoping someone can view the modem traces I have here to help me understand why the mqtt connection is being reset. I'd like to test a keep-alive on the order of hours, but the connection is being reset at PINGREQs.

I'm using a nrf9160DK, MFW 1.3.5 with a custom MQTT code based off of the mqtt_helper library from ncs 2.4.0.

I'm able to connect to both HiveMQ cloud (free version) and Eclipse Mosquitto brokers, and adjust the keep-alive timeouts up to a point. However, with the HiveMQ broker, if a PINGREQ is sent at >=5min, it looks like the broker resets the connection. The same behavior is seen with Eclipse Mosquitto, but the reset is happening at >=1 hour. 

Below are the modem traces  for the HiveMQ broker (trace_ka_600_fail.*) and the Mosquitto broker (mosq_trace_KA_3600_fail.*)

Zipped the 2 trace files here as DevZone was throwing an error about the file extension.

trace_files.zip

Parents Reply Children
  • Hi, 

    So from the logs it seems that the issue is likely the keepalive. 

    In both cases a RST, ACK sent from server terminating the TCP connection, this is likely due to the MQTT connection timing out. Even though the client request a certain MQTT keepalive it doesn’t mean that the server supports that value.


    An to add from our expert on the matter:

    My current conclusion (strong guess without seeing server side traces and/or not knowing how the server side architecture) is that the NAT binding times out on the carrier network side (AT&T) and thus the load balancer / Firewall on the Hive could side (FQDN c59829c0af344d2aa83346efbde5e10c.s1.eu.hivemq.cloud) does not recognize the TCP session anymore (or the session timed out) or the PINREQ packet gets forwarded to a wrong MQTT server as the source IP/Port has changed thus managing existing session mappings. In all cases the cloud side server / load balancer / Firewall responds with RST. Business as usual.

    Why I refer to load balancer / firewall? That is an educated guess based on the IP header identification and TTL fields.

    • During the 3-way handshake the server side uses identifications from 0 onwards and TTL is 234.

    • Once the connection moves to TLS the identification field jumps to e.g. 0x9ad7 onwards → high chances this is a different node behind the load balancer as the “IP session” is completely different. TTL is still the same 234..

    • When the server side responds with an RTS the IP identification is copied from the source packet and but TTL is 208 i.e. not the expected 234 → high chances the RTS comes from a different server or load balancer.

    There are few obvious ways to tackle the issue:

    • better choice of a protocol.

    • Use of Private/enterprise APNs → no issues with NATs or they are under the control of the customer.

    • Use more aggressive keepalive, which is closer what the NAT timeouts are on the network side.

    • Use of IPv6 instead of IPv4, since there is not NAT issues with IPv6.




    Regards,
    Jonathan

Related