MQTT Connection Resets - Keepalive

I'm hoping someone can view the modem traces I have here to help me understand why the mqtt connection is being reset. I'd like to test a keep-alive on the order of hours, but the connection is being reset at PINGREQs.

I'm using a nrf9160DK, MFW 1.3.5 with a custom MQTT code based off of the mqtt_helper library from ncs 2.4.0.

I'm able to connect to both HiveMQ cloud (free version) and Eclipse Mosquitto brokers, and adjust the keep-alive timeouts up to a point. However, with the HiveMQ broker, if a PINGREQ is sent at >=5min, it looks like the broker resets the connection. The same behavior is seen with Eclipse Mosquitto, but the reset is happening at >=1 hour.

Below are the modem traces for the HiveMQ broker (trace_ka_600_fail.*) and the Mosquitto broker (mosq_trace_KA_3600_fail.*)

Zipped the 2 trace files here as DevZone was throwing an error about the file extension.

trace_files.zip

Top Replies

ERob over 1 year ago in reply to JONATHAN LL +1 verified

Thanks Jonathan. Network is AT&T, and using public cloud for the HiveMQ brokers.

Parents

0 JONATHAN LL over 1 year ago

Hi,

Will look at the trace and provide some feedback, update will follow.

What network provider are you using?

And are you using public cloud for these brokers or do you host your own servers in HiveMQ?

Regards,
Jonathan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
+1 ERob over 1 year ago in reply to JONATHAN LL

Thanks Jonathan. Network is AT&T, and using public cloud for the HiveMQ brokers.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Reject Answer

Cancel
0 JONATHAN LL over 1 year ago in reply to ERob
Hi,

So from the logs it seems that the issue is likely the keepalive.

In both cases a RST, ACK sent from server terminating the TCP connection, this is likely due to the MQTT connection timing out. Even though the client request a certain MQTT keepalive it doesn’t mean that the server supports that value.

An to add from our expert on the matter:

My current conclusion (strong guess without seeing server side traces and/or not knowing how the server side architecture) is that the NAT binding times out on the carrier network side (AT&T) and thus the load balancer / Firewall on the Hive could side (FQDN c59829c0af344d2aa83346efbde5e10c.s1.eu.hivemq.cloud) does not recognize the TCP session anymore (or the session timed out) or the PINREQ packet gets forwarded to a wrong MQTT server as the source IP/Port has changed thus managing existing session mappings. In all cases the cloud side server / load balancer / Firewall responds with RST. Business as usual.

Why I refer to load balancer / firewall? That is an educated guess based on the IP header identification and TTL fields.

During the 3-way handshake the server side uses identifications from 0 onwards and TTL is 234.

Once the connection moves to TLS the identification field jumps to e.g. 0x9ad7 onwards → high chances this is a different node behind the load balancer as the “IP session” is completely different. TTL is still the same 234..

When the server side responds with an RTS the IP identification is copied from the source packet and but TTL is 208 i.e. not the expected 234 → high chances the RTS comes from a different server or load balancer.

There are few obvious ways to tackle the issue:

better choice of a protocol.

Use of Private/enterprise APNs → no issues with NATs or they are under the control of the customer.

Use more aggressive keepalive, which is closer what the NAT timeouts are on the network side.

Use of IPv6 instead of IPv4, since there is not NAT issues with IPv6.

Regards,
Jonathan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 JONATHAN LL over 1 year ago in reply to ERob
Hi,

So from the logs it seems that the issue is likely the keepalive.

In both cases a RST, ACK sent from server terminating the TCP connection, this is likely due to the MQTT connection timing out. Even though the client request a certain MQTT keepalive it doesn’t mean that the server supports that value.

An to add from our expert on the matter:

My current conclusion (strong guess without seeing server side traces and/or not knowing how the server side architecture) is that the NAT binding times out on the carrier network side (AT&T) and thus the load balancer / Firewall on the Hive could side (FQDN c59829c0af344d2aa83346efbde5e10c.s1.eu.hivemq.cloud) does not recognize the TCP session anymore (or the session timed out) or the PINREQ packet gets forwarded to a wrong MQTT server as the source IP/Port has changed thus managing existing session mappings. In all cases the cloud side server / load balancer / Firewall responds with RST. Business as usual.

Why I refer to load balancer / firewall? That is an educated guess based on the IP header identification and TTL fields.

During the 3-way handshake the server side uses identifications from 0 onwards and TTL is 234.

Once the connection moves to TLS the identification field jumps to e.g. 0x9ad7 onwards → high chances this is a different node behind the load balancer as the “IP session” is completely different. TTL is still the same 234..

When the server side responds with an RTS the IP identification is copied from the source packet and but TTL is 208 i.e. not the expected 234 → high chances the RTS comes from a different server or load balancer.

There are few obvious ways to tackle the issue:

better choice of a protocol.

Use of Private/enterprise APNs → no issues with NATs or they are under the control of the customer.

Use more aggressive keepalive, which is closer what the NAT timeouts are on the network side.

Use of IPv6 instead of IPv4, since there is not NAT issues with IPv6.

Regards,
Jonathan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

No Data