Unexpected behaviour during TCP communication

Hey team,

We're having trouble with our device. It's been working fine, sending messages to our backend without a hitch. But after a while, it starts having problems connecting to the socket. We're seeing error 116.

<err> rest_client: Failed to connect socket, error: 116
<err> rest_client: rest_client_do_api_call() failed, err -111

#define ETIMEDOUT 116		/* Connection timed out */
#define ECONNREFUSED 111	/* Connection refused */

In our firmware, we've tried implementing a retry cycle, but it hasn't been effective. The modem keeps encountering this error repeatedly. The only solution seems to be rebooting the device.

We don't think the problem lies with our backend because we've encountered the same error when trying to connect to api.nfrcloud.com.

This issue has occurred on two different networks in two different countries and the main point it always disappear after rebooting. 

Could you please review the trace collected from the modem? Thank you.

# Networking
CONFIG_NETWORKING=y
CONFIG_NET_SOCKETS=y
CONFIG_NET_SOCKETS_OFFLOAD=y
CONFIG_NET_SOCKETS_POSIX_NAMES=y
CONFIG_NET_NATIVE=n

#Rest client config
CONFIG_REST_CLIENT=y

CONFIG_HTTP_CLIENT=y


The modem firmware is 1.3.4
SDK 2.0.2

  • Hi Roman,

    I have forwarded the query to the modem team.

    Meanwhile, can you answer following regarding this:

    We're having trouble with our device. It's been working fine,

    1) Are these custom boards? Did you get them reviewed?

    2) Have they (your devices) been working for long?

    3) When the problem started to occur? Did you do some updates (in the firmware, mfw, sdk) and then it occurred, or what?

    4) As you said that "only solution is rebooting". After rebooting, do you still see the problem? after how long the problem occurs?

    5) Is this problem happening with all devices, or just two devices? How many devices you have by the way. You have also mentioned about two locations and two networks. How many different locations your devices are employed in and are some working fine with other networks?

    I will provide more input to the modem team after your response and elaboration.

    Regards, 

    Naeem

  • Hi Roman,

    I have noticed that that the log.zip (the trace) that you have attached is broken

    If I click on the link then I get this message:

    If I click and save it, then I can not open the zip file

    Please upload the trace again.

  • Hi Naeem,

    I reattach the trace file, please check.

     5228.log.zip

    1) Are these custom boards? Did you get them reviewed?

    Yes this is custom board, not sure if it was reviewed.

    2) Have they (your devices) been working for long?

    Yes, the issue is randomly appear - some time it may be second message, some time after half a day.

    3) When the problem started to occur? Did you do some updates (in the firmware, mfw, sdk) and then it occurred, or what?

    Our device periodically send update message to the backend, each 1 h, after some time it is not able to send because of this error.

    4) As you said that "only solution is rebooting". After rebooting, do you still see the problem? after how long the problem occurs?

    It is random - the only point which I am sure, it is never happened with the first message after reboot.

    5) Is this problem happening with all devices, or just two devices? How many devices you have by the way. You have also mentioned about two locations and two networks. How many different locations your devices are employed in and are some working fine with other networks?

    This happening on all the devices we testing, approximately 5 devices. We testing in 2 location Netherlands and Ukraine, devices configured to support both networks NB-IoT and LTE-M, issue appears on both. Also we use different operators for UA and NL.  

  • Thank you for the update.

    I have forwarded the bin file and the details you have provided.

    Regards,

    Naeem

  • Hi Roman,

    Below is the response I have received from the modem team.

    ----------------------------------------------------------

    It is very difficult to comment anything meaningful because the modem log is taken with incomplete trace set. Modem Socket API and AT command traces are missing so we can’t verify how application is communicating with modem.

    One the other hand, a lot of traces seem to be dropped. There are 20 TCP streams in the modem log and none of those are complete. Wireshark shows a lot of missing or duplicate packets but that might be just because of missing traces. Some of the TCP streams are closed in TCP level and others in TLS level. We can’t tell if the connection is closed by the modem or application.

    A modem log with default trace set is needed. This should bring information to application-modem communication. Would it be possible to get pcap from server side? That would help if we must study is the dropped IP packets the reason for closing the connection.

     Few questions as you are sending data to the cloud in 1 hour periods.

    1. Are you opening and closing the TLS connection for each send?

    2. Does the error happen when opening the TLS connection or when sending data over TLS?

    3. Does the server keep the connection open for a required time?

    ----------------------------------------------------------

    with regards,

    Naeem

Related