Persistent (-1) error in POSIX connect() function for TCP reconnection

I am doing HTTP POST's and GET's from my nRF52840-based device to a Java Spring Boot HTTP server running on a laptop via TCP. The device is communicating with a border router via OpenThread. The border router and my laptop are both on my local WAN (laptop's IP is 192.168.1.27). When the device boots up and the server is running, I can do POST's and GET's without issue. When I turn off the server to test the device's reconnection logic, I see the SYN message from the device trying to establish a connection with the server. The connect() function subsequently returns (-1). If I bring the server back up while the device is still making these SYN attempts, the connection is re-established and everything resumes properly.

However, if I do not bring up the server after some time (usually a few minutes), the device stops sending SYN messages and only does DNS lookups of the server hostname. The connect() function still returns error (-1) but no longer actually tries to establish a connection. If I bring the server back up, the device never re-establishes connection with the server due to the lack of SYN messages.

What is going on with the device's TCP connection that causes it to stop SYN reconnection attempts?

OpenThread Wireshark traces:

   

Relevant Code:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define HTTP_POST_MESSAGE_FORMAT \
"POST /%s HTTP/1.1\r\n" \
"Host: %s\r\n" \
"Connection: close\r\n" \
"Accept: application/json\r\n" \
"Content-Type: application/json\r\n" \
"Content-Length: %d\r\n" \
"\r\n" \
"%s"
static void otHTTPDNScallback(otError aError, const otDnsAddressResponse *aResponse, void *aContext)
{
otError oterr = OT_ERROR_NONE;
char tmpbuf[50];
uint16_t i=0;
if (aError==OT_ERROR_NONE)
{
while (oterr!=OT_ERROR_NOT_FOUND && i<40) //set an i-limit to avoid enless loop
{
oterr = otDnsAddressResponseGetAddress(aResponse, i, aContext, NULL);
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Relevant configs:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
CONFIG_ENABLE_THREAD_NETWORK=y
CONFIG_OPENTHREAD_THREAD_VERSION_1_2=y
CONFIG_OPENTHREAD_NORDIC_LIBRARY_MTD=y
CONFIG_OPENTHREAD_FTD=n
CONFIG_OPENTHREAD_MTD=y
CONFIG_OPENTHREAD_MTD_SED=y
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=4096
CONFIG_OPENTHREAD_SHELL=n
CONFIG_OT_CHANNEL=19
CONFIG_NET_IPV6=y
CONFIG_NET_IPV4=n
CONFIG_NET_CONFIG_SETTINGS=y
CONFIG_NET_CONFIG_NEED_IPV4=n
CONFIG_NET_CONFIG_NEED_IPV6=y
# Configure dependencies
CONFIG_IEEE802154_2015=y
CONFIG_IEEE802154_NRF5_RX_STACK_SIZE=800
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Parents
  • Hello,

    I need some more information from you. Please answer all of the questions below.

    Which nRF Connect SDK version are you using?

    Is your project based on a sample?

    Have you set up a maximum amount of repetitions for a SYN message on your nRF52840 device? You could try to add a restart of SYN message sending to the application which is triggered after a timeout or external input, f.ex.

    How is your Thread border router configured?

    You have CONFIG_OPENTHREAD_TCP_ENABLE=n in your .conf file. Is your intention to use the Zephyr implementation of TCP instead of the TCP implementation in OpenThread? In Thread 1.3, support for TCP was introduced which makes TCP more efficient in an IEEE 802.15.4 network.

    Best regards,

    Maria

  • Hello,

    Were you able to look into this further? This is one of the final issues in our development process.

    I just observed an interesting transmission sequence that could help with troubleshooting. The device successfully sent a large number (~650) of locally stored messages to the server. In the middle of clearing this local message backlog, the device started to exhibit the behavior in my original post. This is interesting because the server never went offline and messages were sending successfully until the error was encountered abruptly. There were a few failed transmissions in the middle of the successful sends, but the those messages were always sent successfully on the first retry. Maybe I have a buffer that is filling or sockets that aren't being closed?

    ^See original post for code that generates this error message. This error occurs for each of the DNS lookups in the bottom-most blue block of Wireshark messages

  • I believe my client is restarting the connection request completely each time.

    After connect() returns the 116 timeout error, the socket is closed via "close(sock)". The next time a connection is attempted, the application starts from the top with "sock = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);" followed by "err = connect(sock, &brok, sizeof(struct sockaddr_in6));".

  • Hello,

    One of our developers has looked into this and has given some feedback.

    Your close/open/connect sequence is done correctly.

    We need to check for packet leaks:

    Enable CONFIG_NET_SHELL=y and CONFIG_NET_DEBUG_NET_PKT_ALLOC=y

    Use the net allocs shell command to check for packet leaks.

    Use the net conn shell command to check for net_context leaks.

    The following is more for information, because it is hard to say if it is related without more information from the shell commands. There was a recent fix for a bug in the TCP handshake (handling RST packets). This was fixed in NCS v2.5.0, and the relevant commit is found here: https://github.com/nrfconnect/sdk-zephyr/commit/1907327297f59993d41f2f7f5af50968b0316289. The bug could cause unexpected behaviour, so it could be worth to switch to NCS v2.5.0 or cherry pick the commit I linked.

    Best regards,

    Maria

  • Thank you, I have switched to NCS v2.5.0 and zephyr SDK v0.16.1 based on your recommendation. I am able to compile and flash the device, but it gets stuck "Jumping to the first image slot."

    This seems to be a common issue involving TF-M and SPM, but I have been unable to resolve it. Do you have any recommendations for resolution of this hang-up? Thanks.

  • Are you using MCUBoot? In the image trailer section of the MCUBoot Bootloader documentation there is an overview of what values the magic, image-ok, and copy-done fields can have to enter different states. Your case falls into State IV.

    I'm not very familiar with MCUBoot, so my recommendation is that you create a new ticket where you share your setup and another engineer can help you more efficiently.

    Another reason to open a new ticket is that your original issue was resolved by the upgrade to NCS v2.5.0.

    I will help you to the best of my ability if it's not an option for you to create a new ticket.

    Best regards,

    Maria

  • Yes, I am using MCUBoot. I will be on leave shortly, so I will create a new ticket for the v2.5.0 issue when I return in a few weeks.

Reply Children
No Data