Why might zsock_connect() work fine for days, then suddenly begin to fail every time with NRF_ETIMEDOUT?

Using the nRF9151 with nRF Connect SDK v3.1.0 and modem firmware 2.0.2, we're seeing an issue where the modem works fine for days, then it is suddenly unable to open any sockets, with attempts failing with NRF_ETIMEDOUT.  We've seen this happen on two out of a handful of devices in initial trials.

Specifically, we connect to a cloud server every 10 minutes and send a report.  This starts with a call to zsock_connect().  In problem devices, this begins to return -1, with errno set to NRF_ETIMEDOUT.

In tracing this call, we found that zsock_connect() calls nrf9x_socket_offload_connect(), as the sockets are offloaded to the modem.  Then, this function calls nrf_connect() on line 462 of nrf9x_sockets.c (as it is an IPv4 address we're connecting to).  It is this function that sets errno to NRF_ETIMEDOUT, but as this function is in the modem firmware we cannot inspect it.  The function waits approximately 25 seconds before returning, so it does seem like the modem is trying to do something before timing out.

Do you have any theories as to why this would happen?  If the connection were poor, we'd expect occasional timeouts, but in this specific issue, everything is fine, until all of a sudden, every connection attempt times out.  Power-cycling the system seems to cause it to start working again, so it doesn't seem to be an issue of poor signal.

For reference, lte_lc_conn_eval_params_get() returns the following on a unit in the problem state:

struct lte_lc_conn_eval_params {
    enum lte_lc_rrc_mode rrc_state = LTE_LC_RRC_MODE_IDLE;
    enum lte_lc_energy_estimate energy_estimate = LTE_LC_ENERGY_CONSUMPTION_NORMAL;
    enum lte_lc_tau_triggered tau_trig = LTE_LC_CELL_IN_TAI_LIST;
    enum lte_lc_ce_level ce_level = LTE_LC_CE_LEVEL_0;
    int earfcn = 5035;
    int16_t dl_pathloss = 105; //105 dB
    int16_t rsrp = 54;  // -87 dBm
    int16_t rsrq = 14;  // -13.0 dB
    int16_t tx_rep = 1;
    int16_t rx_rep = 8;
    int16_t phy_cid = 204;
    int16_t band = 12;
    int16_t snr = 25;  // 1 dB
    int16_t tx_power = 1;  // 1 dBm
    int mcc = 310;
    int mnc = 260;
    uint32_t cell_id = <redacted>;
};

We'd appreciate any insight you can provide into the causes of this error.

  • Hi,

    Thanks for the detailed description. Since the nRF9151 uses modem-offloaded sockets, the NRF_ETIMEDOUT returned by zsock_connect() is coming directly from the modem. As a first step, could you please try the same test with a newer nRF91x1 modem firmware (v2.0.4), which includes service-reliability improvements. If the issue persists, to help narrow this down, could you please collect the following while the device is in the failing state (before rebooting):

    AT+CEREG?
    AT+CGATT?
    AT+CGACT?
    AT+CGPADDR[=<cid>]

    This will help us verify whether the modem is still registered, packet-service attached, has an active PDN, and an IP address when the NRF_ETIMEDOUT starts occurring.

    Additionally, please reproduce the issue with modem tracing enabled and share the trace and application logs. This will help us determine whether the network deactivated the data session or whether the modem entered a stuck internal state. You can enable and capture modem traces as described in the nRF91 modem tracing documentation.

    Best Regards,
    Syed Maysum

  • Thanks for the recommendations! Unfortunately, we don't have the ability to run AT commands on our current firmware, and we don't have modem trace enabled, otherwise we would have tried more things! We did have a way to call lte_lc_conn_eval_params_get(), so that's why we did that.

    We'll make a new version with these abilities enabled and see if the issue pops up again. If it does, then we'll have the tools to investigate further. As the issue is rare, it could be a while, but we'll run another test and let you know if anything happens.

  • Hi,

    That sounds good. Adding AT-command access and modem tracing in the next firmware version is a good approach. Since the issue is rare, having these available will really help if it reproduces.

    Please let us know once you’ve rerun the test or if the issue appears again.

    Best Regards,
    Syed Maysum

Related