Unexpected behaviour during TCP communication

Hey team,

We're having trouble with our device. It's been working fine, sending messages to our backend without a hitch. But after a while, it starts having problems connecting to the socket. We're seeing error 116.

<err> rest_client: Failed to connect socket, error: 116
<err> rest_client: rest_client_do_api_call() failed, err -111

#define ETIMEDOUT 116		/* Connection timed out */
#define ECONNREFUSED 111	/* Connection refused */

In our firmware, we've tried implementing a retry cycle, but it hasn't been effective. The modem keeps encountering this error repeatedly. The only solution seems to be rebooting the device.

We don't think the problem lies with our backend because we've encountered the same error when trying to connect to api.nfrcloud.com.

This issue has occurred on two different networks in two different countries and the main point it always disappear after rebooting. 

Could you please review the trace collected from the modem? Thank you.

# Networking
CONFIG_NETWORKING=y
CONFIG_NET_SOCKETS=y
CONFIG_NET_SOCKETS_OFFLOAD=y
CONFIG_NET_SOCKETS_POSIX_NAMES=y
CONFIG_NET_NATIVE=n

#Rest client config
CONFIG_REST_CLIENT=y

CONFIG_HTTP_CLIENT=y


The modem firmware is 1.3.4
SDK 2.0.2

Parents
  • Hi Roman,

    I have noticed that that the log.zip (the trace) that you have attached is broken

    If I click on the link then I get this message:

    If I click and save it, then I can not open the zip file

    Please upload the trace again.

  • Hi Naeem,

    I reattach the trace file, please check.

     5228.log.zip

    1) Are these custom boards? Did you get them reviewed?

    Yes this is custom board, not sure if it was reviewed.

    2) Have they (your devices) been working for long?

    Yes, the issue is randomly appear - some time it may be second message, some time after half a day.

    3) When the problem started to occur? Did you do some updates (in the firmware, mfw, sdk) and then it occurred, or what?

    Our device periodically send update message to the backend, each 1 h, after some time it is not able to send because of this error.

    4) As you said that "only solution is rebooting". After rebooting, do you still see the problem? after how long the problem occurs?

    It is random - the only point which I am sure, it is never happened with the first message after reboot.

    5) Is this problem happening with all devices, or just two devices? How many devices you have by the way. You have also mentioned about two locations and two networks. How many different locations your devices are employed in and are some working fine with other networks?

    This happening on all the devices we testing, approximately 5 devices. We testing in 2 location Netherlands and Ukraine, devices configured to support both networks NB-IoT and LTE-M, issue appears on both. Also we use different operators for UA and NL.  

  • Thank you for the update.

    I have forwarded the bin file and the details you have provided.

    Regards,

    Naeem

  • Hi Roman,

    Below is the response I have received from the modem team.

    ----------------------------------------------------------

    It is very difficult to comment anything meaningful because the modem log is taken with incomplete trace set. Modem Socket API and AT command traces are missing so we can’t verify how application is communicating with modem.

    One the other hand, a lot of traces seem to be dropped. There are 20 TCP streams in the modem log and none of those are complete. Wireshark shows a lot of missing or duplicate packets but that might be just because of missing traces. Some of the TCP streams are closed in TCP level and others in TLS level. We can’t tell if the connection is closed by the modem or application.

    A modem log with default trace set is needed. This should bring information to application-modem communication. Would it be possible to get pcap from server side? That would help if we must study is the dropped IP packets the reason for closing the connection.

     Few questions as you are sending data to the cloud in 1 hour periods.

    1. Are you opening and closing the TLS connection for each send?

    2. Does the error happen when opening the TLS connection or when sending data over TLS?

    3. Does the server keep the connection open for a required time?

    ----------------------------------------------------------

    with regards,

    Naeem

  • Hi  

    I will try to collect the full modem traces but the issue is that I am sure when I enable the 

    RF_MODEM_LIB_TRACE_LEVEL_FULL - it missed huge amount of the traces not sure why.

    Here is my debug configuration:

    CONFIG_LOG=y
    CONFIG_LOG_MODE_IMMEDIATE=y
    CONFIG_LOG_BACKEND_RTT=y
    
    CONFIG_CONSOLE=y
    CONFIG_RTT_CONSOLE=y
    CONFIG_UART_CONSOLE=n
    
    CONFIG_USE_SEGGER_RTT=y
    
    CONFIG_NRF_MODEM_LIB_TRACE_ENABLED=y
    CONFIG_NRF_MODEM_LIB_TRACE_THREAD_PROCESSING=y
    CONFIG_NRF_MODEM_LIB_TRACE_LEVEL_LTE_AND_IP=y
    CONFIG_NRF_MODEM_LIB_TRACE_MEDIUM_RTT=y
    CONFIG_NRF_MODEM_LIB_TRACE_MEDIUM_RTT_BUF_SIZE=2048
    CONFIG_NRF_MODEM_LIB_SENDMSG_BUF_SIZE=1024
    We collecting traces over RTT. 
    1. Are you opening and closing the TLS connection for each send?

    I am using the rest_client library from sdk as I understood it should be done on the library side. 

    Here is part of the code responsible for the requests:

    static struct cloud_rest_context rest_ctx = {
        .connect_socket = REST_CLIENT_SCKT_CONNECT,
        .keep_alive = false,
        .rx_buf = rx_buf,
        .rx_buf_len = sizeof(rx_buf),
        .fragment_size = 0};
    
    static void init_rest_client_request(
        struct cloud_rest_context const *const rest_ctx,
        struct rest_client_req_context *const req, const enum http_method meth)
    {
        memset(req, 0, sizeof(*req));
    
        req->connect_socket = rest_ctx->connect_socket;
        req->keep_alive = rest_ctx->keep_alive;
    
        req->resp_buff = rest_ctx->rx_buf;
        req->resp_buff_len = rest_ctx->rx_buf_len;
    
        req->sec_tag = 1;
        req->port = sPort;
        req->host = sHost;
        req->tls_peer_verify = REST_CLIENT_TLS_DEFAULT_PEER_VERIFY;
        req->timeout_ms = sDataTimeout;
    
        req->http_method = meth;
    
        (void)cloud_codec_init();
    }
    
    static void sync_rest_client_data(struct cloud_rest_context *const rest_ctx,
                                      struct rest_client_req_context const *const req,
                                      struct rest_client_resp_context const *const resp)
    {
        rest_ctx->status = resp->http_status_code;
        rest_ctx->response = resp->response;
        rest_ctx->response_len = resp->response_len;
        rest_ctx->total_response_len = resp->total_response_len;
    
        rest_ctx->connect_socket = req->connect_socket;
    }
    
    static int do_rest_client_request(struct cloud_rest_context *const rest_ctx,
                                      struct rest_client_req_context *const req,
                                      struct rest_client_resp_context *const resp,
                                      bool check_status_good, bool expect_body)
    {
        int64_t timing;
    
        timing = k_uptime_get();
        int ret = rest_client_request(req, resp);
        timing = k_uptime_get() - timing;
        DLOG_I("Request time:%d sec", timing / 1000);
    
        sync_rest_client_data(rest_ctx, req, resp);
    
        if (ret) {
            DLOG_E("REST client request failed with error code %d", ret);
            return ret;
        } else if (rest_ctx->status == MONTR_CLOUD_HTTP_STATUS_NONE) {
            DLOG_E("REST request endpoint closed connection without reply.");
            return -ESHUTDOWN;
        } else if (check_status_good &&
                   (rest_ctx->status != MONTR_CLOUD_HTTP_STATUS_OK) &&
                   (rest_ctx->status != MONTR_CLOUD_HTTP_STATUS_ACCEPTED) &&
                   (rest_ctx->status != MONTR_CLOUD_HTTP_STATUS_UPDATE_AVALIABLE)) {
            DLOG_E("REST request was rejected. Response status: %d", rest_ctx->status);
            return -EBADMSG;
        } else if (expect_body && (!rest_ctx->response || !rest_ctx->response_len)) {
            DLOG_E("Empty response body: %d", rest_ctx->status);
            return -ENODATA;
        }
    
        return 0;
    }
    
    static void close_connection(struct cloud_rest_context *const rest_ctx)
    {
        int err;
        if (!rest_ctx->keep_alive) {
            (void)cloud_rest_disconnect(rest_ctx);
        }
    }
    
    int cloud_rest_disconnect(struct cloud_rest_context *const rest_ctx)
    {
        if (!rest_ctx) {
            return -EINVAL;
        } else if (rest_ctx->connect_socket < 0) {
            return -ENOTCONN;
        }
    
        int err = close(rest_ctx->connect_socket);
    
        if (err) {
            DLOG_E("Failed to close socket, error: %d", errno);
            err = -EIO;
        } else {
            rest_ctx->connect_socket = -1;
        }
    
        return err;
    }
    
    int rest_send_message(struct cloud_rest_context *const rest_ctx,
                                const char *const json_msg, const char *const url)
    {
        assert(rest_ctx != NULL);
        assert(json_msg != NULL);
    
        int ret;
        struct rest_client_req_context req;
        struct rest_client_resp_context resp;
    
        memset(&resp, 0, sizeof(resp));
        init_rest_client_request(rest_ctx, &req, HTTP_POST);
    
        /* Set payload */
        req.body = json_msg;
        /*Set url*/
        req.url = url;
    
        char *const headers[] = {CONTENT_TYPE_APP_JSON, NULL};
    
        req.header_fields = (const char **)headers;
    
        /* Make REST call */
        ret = do_rest_client_request(rest_ctx, &req, &resp, true, true);
    
        close_connection(rest_ctx);
    
        return ret;
    }


    2. Does the error happen when opening the TLS connection or when sending data over TLS?

    As I understood this log means that it happened during connection 

    <err> rest_client: Failed to connect socket, error: 116

    3. Does the server keep the connection open for a required time?

    I will figure this out but what should be the correct value? 
    Also the one point - some time it also happens when we send request to the nrfCloud for cell tower location. 

    err = nrf_cloud_rest_cell_pos_get(&rest_ctx, &cell_pos_req, &cell_pos_result);

Reply
  • Hi  

    I will try to collect the full modem traces but the issue is that I am sure when I enable the 

    RF_MODEM_LIB_TRACE_LEVEL_FULL - it missed huge amount of the traces not sure why.

    Here is my debug configuration:

    CONFIG_LOG=y
    CONFIG_LOG_MODE_IMMEDIATE=y
    CONFIG_LOG_BACKEND_RTT=y
    
    CONFIG_CONSOLE=y
    CONFIG_RTT_CONSOLE=y
    CONFIG_UART_CONSOLE=n
    
    CONFIG_USE_SEGGER_RTT=y
    
    CONFIG_NRF_MODEM_LIB_TRACE_ENABLED=y
    CONFIG_NRF_MODEM_LIB_TRACE_THREAD_PROCESSING=y
    CONFIG_NRF_MODEM_LIB_TRACE_LEVEL_LTE_AND_IP=y
    CONFIG_NRF_MODEM_LIB_TRACE_MEDIUM_RTT=y
    CONFIG_NRF_MODEM_LIB_TRACE_MEDIUM_RTT_BUF_SIZE=2048
    CONFIG_NRF_MODEM_LIB_SENDMSG_BUF_SIZE=1024
    We collecting traces over RTT. 
    1. Are you opening and closing the TLS connection for each send?

    I am using the rest_client library from sdk as I understood it should be done on the library side. 

    Here is part of the code responsible for the requests:

    static struct cloud_rest_context rest_ctx = {
        .connect_socket = REST_CLIENT_SCKT_CONNECT,
        .keep_alive = false,
        .rx_buf = rx_buf,
        .rx_buf_len = sizeof(rx_buf),
        .fragment_size = 0};
    
    static void init_rest_client_request(
        struct cloud_rest_context const *const rest_ctx,
        struct rest_client_req_context *const req, const enum http_method meth)
    {
        memset(req, 0, sizeof(*req));
    
        req->connect_socket = rest_ctx->connect_socket;
        req->keep_alive = rest_ctx->keep_alive;
    
        req->resp_buff = rest_ctx->rx_buf;
        req->resp_buff_len = rest_ctx->rx_buf_len;
    
        req->sec_tag = 1;
        req->port = sPort;
        req->host = sHost;
        req->tls_peer_verify = REST_CLIENT_TLS_DEFAULT_PEER_VERIFY;
        req->timeout_ms = sDataTimeout;
    
        req->http_method = meth;
    
        (void)cloud_codec_init();
    }
    
    static void sync_rest_client_data(struct cloud_rest_context *const rest_ctx,
                                      struct rest_client_req_context const *const req,
                                      struct rest_client_resp_context const *const resp)
    {
        rest_ctx->status = resp->http_status_code;
        rest_ctx->response = resp->response;
        rest_ctx->response_len = resp->response_len;
        rest_ctx->total_response_len = resp->total_response_len;
    
        rest_ctx->connect_socket = req->connect_socket;
    }
    
    static int do_rest_client_request(struct cloud_rest_context *const rest_ctx,
                                      struct rest_client_req_context *const req,
                                      struct rest_client_resp_context *const resp,
                                      bool check_status_good, bool expect_body)
    {
        int64_t timing;
    
        timing = k_uptime_get();
        int ret = rest_client_request(req, resp);
        timing = k_uptime_get() - timing;
        DLOG_I("Request time:%d sec", timing / 1000);
    
        sync_rest_client_data(rest_ctx, req, resp);
    
        if (ret) {
            DLOG_E("REST client request failed with error code %d", ret);
            return ret;
        } else if (rest_ctx->status == MONTR_CLOUD_HTTP_STATUS_NONE) {
            DLOG_E("REST request endpoint closed connection without reply.");
            return -ESHUTDOWN;
        } else if (check_status_good &&
                   (rest_ctx->status != MONTR_CLOUD_HTTP_STATUS_OK) &&
                   (rest_ctx->status != MONTR_CLOUD_HTTP_STATUS_ACCEPTED) &&
                   (rest_ctx->status != MONTR_CLOUD_HTTP_STATUS_UPDATE_AVALIABLE)) {
            DLOG_E("REST request was rejected. Response status: %d", rest_ctx->status);
            return -EBADMSG;
        } else if (expect_body && (!rest_ctx->response || !rest_ctx->response_len)) {
            DLOG_E("Empty response body: %d", rest_ctx->status);
            return -ENODATA;
        }
    
        return 0;
    }
    
    static void close_connection(struct cloud_rest_context *const rest_ctx)
    {
        int err;
        if (!rest_ctx->keep_alive) {
            (void)cloud_rest_disconnect(rest_ctx);
        }
    }
    
    int cloud_rest_disconnect(struct cloud_rest_context *const rest_ctx)
    {
        if (!rest_ctx) {
            return -EINVAL;
        } else if (rest_ctx->connect_socket < 0) {
            return -ENOTCONN;
        }
    
        int err = close(rest_ctx->connect_socket);
    
        if (err) {
            DLOG_E("Failed to close socket, error: %d", errno);
            err = -EIO;
        } else {
            rest_ctx->connect_socket = -1;
        }
    
        return err;
    }
    
    int rest_send_message(struct cloud_rest_context *const rest_ctx,
                                const char *const json_msg, const char *const url)
    {
        assert(rest_ctx != NULL);
        assert(json_msg != NULL);
    
        int ret;
        struct rest_client_req_context req;
        struct rest_client_resp_context resp;
    
        memset(&resp, 0, sizeof(resp));
        init_rest_client_request(rest_ctx, &req, HTTP_POST);
    
        /* Set payload */
        req.body = json_msg;
        /*Set url*/
        req.url = url;
    
        char *const headers[] = {CONTENT_TYPE_APP_JSON, NULL};
    
        req.header_fields = (const char **)headers;
    
        /* Make REST call */
        ret = do_rest_client_request(rest_ctx, &req, &resp, true, true);
    
        close_connection(rest_ctx);
    
        return ret;
    }


    2. Does the error happen when opening the TLS connection or when sending data over TLS?

    As I understood this log means that it happened during connection 

    <err> rest_client: Failed to connect socket, error: 116

    3. Does the server keep the connection open for a required time?

    I will figure this out but what should be the correct value? 
    Also the one point - some time it also happens when we send request to the nrfCloud for cell tower location. 

    err = nrf_cloud_rest_cell_pos_get(&rest_ctx, &cell_pos_req, &cell_pos_result);

Children
Related