nRF9160 crash on assert after going offline then online

Hello, I'm developing firmware for our custom board with a nRF9160.

The problem I'm having is that the modem crashes with error code 0x10 sometimes (read: about every 3rd or 4th time) after its turned off and on again and we try to connect to mqtt.

Because it's supposed to be battery powered we take the modem offline when the application has collected its sensordata and sent to our cloud.
The board is then woken up after a set time or after one of two external interupts are triggered.

We used asset tracker v2 as a base for developing the application and the way we do the sleep is like following:

static void do_sleep(k_tid_t sleep_thread_id)
{
    printk("modem::do_sleep(): disconnecting lte modem\n");
    //lte_lc_power_off(); // dont power off, it should use about the same amp when in flight mode
    lte_lc_offline();
    
    DEFINE_DO_SLEEP(modem, MODEM_EVT_ERROR, sleep_thread_id);
    
    state_set(STATE_DISCONNECTED);
    
    int err = lte_connect();
    if (err) {
        LOG_ERR("Failed connecting to LTE, error: %d", err);
        SEND_ERROR(modem, MODEM_EVT_ERROR, err);
    }
}

#define DEFINE_DO_SLEEP(_mod, _evt, _id) SEND_SLEEP_ACK(self.id); \
    state_set(STATE_SLEEP); \
    printk("%s::do_sleep(): sleeping %s thread\n", STRINGIFY(_mod), STRINGIFY(_mod)); \
    printk("%s::do_sleep(): sleep thread id: %p\n", STRINGIFY(_mod), _id); \
    if (_id == NULL) \
    { \
    printk("Sleep thread was null, something went wrong in the sleep sequence.\n"); \
    SEND_ERROR(_mod, _evt, -1); \
    return; \
    } \
    k_thread_join(_id, K_FOREVER); \
    module_wakeup(&self); \
    printk("%s::do_sleep(): woke up\n", STRINGIFY(_mod));

static int lte_connect(void)
{
    int err;
    
    err = lte_lc_connect_async(lte_evt_handler);
    if (err) {
        LOG_ERR("lte_lc_connect_async, error: %d", err);
        
        return err;
    }
    
    SEND_EVENT(modem, MODEM_EVT_LTE_CONNECTING);
    
    LOG_INF("LTE attempting to connect... waiting %d seconds before timeout", MODEM_LTE_TIMEOUT_SEC);
    k_work_reschedule(&connect_check_work, K_SECONDS(MODEM_LTE_TIMEOUT_SEC));
    
    return 0;
}

The error message we get is the following:
[00:07:03.246,337] <err> modem_module: Modem error: 0x10, PC: 0xda418

Here it is in context:

modem::do_sleep(): woke up
[00:06:59.014,251] <inf> app_event_manager: MODEM_EVT_LTE_CONNECTING
[00:06:59.014,434] <inf> modem_module: LTE attempting to connect... waiting 120 seconds before timeout
[00:06:59.902,587] <inf> app_event_manager: MODEM_EVT_LTE_RSRP_UPDATE
[00:06:59.902,740] <inf> app_event_manager: MODEM_EVT_LTE_DISCONNECTED
[00:07:00.582,489] <inf> app_event_manager: MODEM_EVT_LTE_CONNECTED
[00:07:00.582,733] <wrn> cloud_module: Cloud connection establishment in progress
[00:07:00.582,733] <wrn> cloud_module: New connection attempt in 32 seconds if not successful
[00:07:00.582,824] <inf> app_event_manager: CLOUD_EVT_CONNECTING
[00:07:00.583,435] <inf> pp_cloud: Enabling TLS
[00:07:00.594,451] <inf> app_event_manager: MODEM_EVT_LTE_RSRP_UPDATE
[00:07:03.246,337] <err> modem_module: Modem error: 0x10, PC: 0xda418
[00:07:03.246,459] <err> pp_cloud: mqtt_connect, error: -110
[00:07:03.246,459] <err> pp_cloud: Cloud MQTT broker connect failed -110
[00:07:03.246,520] <inf> app_event_manager: CLOUD_EVT_DISCONNECTED
[00:07:03.246,795] <inf> app_event_manager: CLOUD_EVT_CONNECTION_TIMEOUT
[00:07:03.247,009] <wrn> cloud_module: Cloud connection establishment in progress
[00:07:03.247,039] <wrn> cloud_module: New connection attempt in 32 seconds if not successful
[00:07:03.247,161] <inf> app_event_manager: CLOUD_EVT_CONNECTING
[00:07:03.247,406] <err> pp_cloud: getaddrinfo, error -1
[00:07:03.247,436] <err> pp_cloud: client_broker_init, error: -10
[00:07:03.247,497] <inf> app_event_manager: CLOUD_EVT_DISCONNECTED
[00:07:03.481,109] <inf> modem_module: nRF9160 modem firmware version: mfw_nrf9160_1.2.7
[00:07:03.520,080] <inf> modem_module: LTE attempting to connect... waiting 120 seconds before timeout

When we first got the error it triggered the assert in nrf_to_z_dns_error_code assert because it got -1 from the modem (which is according to modem fault handling documentation).
That leads me to believe that the error happends in the nrf_getaddrinfo function (which is offloaded to the modem).

Can anyone make sense of the error based on the error code and PC? I'm kind of lost on where the debug at the moment :)

EDIT: I forgot to say that I'm experiencing this problem on a nRF9160 revision 0 using modem firmware 1.2.7, this also happends on 1.1.0.

Thanks in advance

  • We would also recommend to make the move to mfw1.3.2, unless there is any specific reason to use mfw1.2.x

  • Our product currently uses B0 (rev1) silicon so we have to use 'mfw1.2.x'.

  • Hi, we found the issue in our code that triggers the modem crash.

    We had a race condition that sometimes put the modem in flight mode before disconnecting mqtt (and therefore the socket). I figured this out because i spun up another MQTT server and on that server i instead get -12 (ENOMEM) when calling zsock_connect (inside of mqtt_connect) which according to some devzone post means that no sockets are available. (It's not in any docs I can find but it makes sense)

    That means that the modem dosent clear sockets when in flight mode (which I assumed it did for some reason). It also seems that nrf_modem_lib_shutdown and nrf_modem_lib_init dosen't clear sockets either because I get continued errors after one modem crash.

    But the modem crash is clearly some bug that triggers because the modem thinks it has sockets available but it actually dosent. I don't have time to figure out what causes this but it presumably is because of combination of TLS cipher suites/certs and maybe port/hostname because that is all that differed between the mqtt servers i spun up.

    But you can assume our issue is fixed but I would give it to your modem team as a bug report ^^

  • Thank you for the update! I will let the modem team comment and get back to you.

  • I now have feed back from the modem team:

    Memory full causes assert:
    00:01:58.639038 TR_ERROR_MEMORY_FULL Failed allocation: size 2028, from heap 0080a0e8 
    00:01:58.639068 MDM_ERROR_ASSERT @PC:0x000d90bc 

    The modem supports up to three simultaneous TLS sockets. In this case the opening of a third TLS socket causes assert due to memory full. You have somehow hit an issue which consumes more memory than expected.

    You would need to go over the application logic. It leaves TLS sockets open thus all reservations for memory stays there.

    The application should close all unused sockets to free modem resources. The modem is indicating that PDN connection is closed (SOCKET_RPC_CLOSED_NTF_MSG) but it is the application's responsibility to close the socket.

Related