This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Bad BSD socket behavior - possibly triggered by unexpected mobile-terminated TCP FIN?

I'm still trying more TCP stress testing similar to this inquiry. However, now I'm getting highly repeatable lockups on the node I am doing testing with.

Application code that has been working fine and reliable for months is now locking up within seconds after boot.  The only change was that I was doing "bad things" on the network before rebooting.  (Lost traffic, slow traffic, keeping sockets quiet for an abusively long time w/o traffic, abandoning sockets w/o warning the host or network, etc)

It appears that the thread that handles the BSD sockets works fine after reboot while starting up a new connection to the server, but then locks up soon after.  The failure seems to happen the next time the application tries to send/receive TCP traffic.

Here's the sequence of what I think is happening:

  1. Modem is online and has a good TCP connection to server X port 1883
  2. Without warning the cellular network or server X, I reboot the modem [NAT mapping and server both remember connection]
  3. Modem goes back online and opens a new TCP connection to server X port 1883
  4. Server X recognizes a new connection from the same client, and sends a FIN to the old port
  5. The FIN packet is routed by the cellular network back to the modem
  6. The BSD socket locks up on the next TCP action by the application

I can't prove step 5 is happening, but I've got strong evidence of all the other steps.  And this lockup is HIGHLY repeatable, while it never used to happen, ever.  The only difference I can think of is the FIN packets are actually reaching the modem.  Our normal reboot/reset procedures do their best to close the TCP connection before rebooting, which would cause the NAT mapping to disappear soon after.

On a reboot, if the application doesn't request a specific source port number for an outbound TCP connection, will the nRF9160 use the same port number every time?  If so, I highly suspect that is part of the issue, since the FIN received on step 5 would be showing up on the same port number as our valid connection we just opened on step 3, with the correct port and IP address for the server side of the connection as well.

I'm going to go look if I can make the MQTT socket use a changing port number when I open the socket and see if that stops this lockup...

Related