nRF9160 crash on assert after going offline then online

Hello, I'm developing firmware for our custom board with a nRF9160.

The problem I'm having is that the modem crashes with error code 0x10 sometimes (read: about every 3rd or 4th time) after its turned off and on again and we try to connect to mqtt.

Because it's supposed to be battery powered we take the modem offline when the application has collected its sensordata and sent to our cloud.
The board is then woken up after a set time or after one of two external interupts are triggered.

We used asset tracker v2 as a base for developing the application and the way we do the sleep is like following:

static void do_sleep(k_tid_t sleep_thread_id)
{
    printk("modem::do_sleep(): disconnecting lte modem\n");
    //lte_lc_power_off(); // dont power off, it should use about the same amp when in flight mode
    lte_lc_offline();
    
    DEFINE_DO_SLEEP(modem, MODEM_EVT_ERROR, sleep_thread_id);
    
    state_set(STATE_DISCONNECTED);
    
    int err = lte_connect();
    if (err) {
        LOG_ERR("Failed connecting to LTE, error: %d", err);
        SEND_ERROR(modem, MODEM_EVT_ERROR, err);
    }
}

#define DEFINE_DO_SLEEP(_mod, _evt, _id) SEND_SLEEP_ACK(self.id); \
    state_set(STATE_SLEEP); \
    printk("%s::do_sleep(): sleeping %s thread\n", STRINGIFY(_mod), STRINGIFY(_mod)); \
    printk("%s::do_sleep(): sleep thread id: %p\n", STRINGIFY(_mod), _id); \
    if (_id == NULL) \
    { \
    printk("Sleep thread was null, something went wrong in the sleep sequence.\n"); \
    SEND_ERROR(_mod, _evt, -1); \
    return; \
    } \
    k_thread_join(_id, K_FOREVER); \
    module_wakeup(&self); \
    printk("%s::do_sleep(): woke up\n", STRINGIFY(_mod));

static int lte_connect(void)
{
    int err;
    
    err = lte_lc_connect_async(lte_evt_handler);
    if (err) {
        LOG_ERR("lte_lc_connect_async, error: %d", err);
        
        return err;
    }
    
    SEND_EVENT(modem, MODEM_EVT_LTE_CONNECTING);
    
    LOG_INF("LTE attempting to connect... waiting %d seconds before timeout", MODEM_LTE_TIMEOUT_SEC);
    k_work_reschedule(&connect_check_work, K_SECONDS(MODEM_LTE_TIMEOUT_SEC));
    
    return 0;
}

The error message we get is the following:
[00:07:03.246,337] <err> modem_module: Modem error: 0x10, PC: 0xda418

Here it is in context:

modem::do_sleep(): woke up
[00:06:59.014,251] <inf> app_event_manager: MODEM_EVT_LTE_CONNECTING
[00:06:59.014,434] <inf> modem_module: LTE attempting to connect... waiting 120 seconds before timeout
[00:06:59.902,587] <inf> app_event_manager: MODEM_EVT_LTE_RSRP_UPDATE
[00:06:59.902,740] <inf> app_event_manager: MODEM_EVT_LTE_DISCONNECTED
[00:07:00.582,489] <inf> app_event_manager: MODEM_EVT_LTE_CONNECTED
[00:07:00.582,733] <wrn> cloud_module: Cloud connection establishment in progress
[00:07:00.582,733] <wrn> cloud_module: New connection attempt in 32 seconds if not successful
[00:07:00.582,824] <inf> app_event_manager: CLOUD_EVT_CONNECTING
[00:07:00.583,435] <inf> pp_cloud: Enabling TLS
[00:07:00.594,451] <inf> app_event_manager: MODEM_EVT_LTE_RSRP_UPDATE
[00:07:03.246,337] <err> modem_module: Modem error: 0x10, PC: 0xda418
[00:07:03.246,459] <err> pp_cloud: mqtt_connect, error: -110
[00:07:03.246,459] <err> pp_cloud: Cloud MQTT broker connect failed -110
[00:07:03.246,520] <inf> app_event_manager: CLOUD_EVT_DISCONNECTED
[00:07:03.246,795] <inf> app_event_manager: CLOUD_EVT_CONNECTION_TIMEOUT
[00:07:03.247,009] <wrn> cloud_module: Cloud connection establishment in progress
[00:07:03.247,039] <wrn> cloud_module: New connection attempt in 32 seconds if not successful
[00:07:03.247,161] <inf> app_event_manager: CLOUD_EVT_CONNECTING
[00:07:03.247,406] <err> pp_cloud: getaddrinfo, error -1
[00:07:03.247,436] <err> pp_cloud: client_broker_init, error: -10
[00:07:03.247,497] <inf> app_event_manager: CLOUD_EVT_DISCONNECTED
[00:07:03.481,109] <inf> modem_module: nRF9160 modem firmware version: mfw_nrf9160_1.2.7
[00:07:03.520,080] <inf> modem_module: LTE attempting to connect... waiting 120 seconds before timeout

When we first got the error it triggered the assert in nrf_to_z_dns_error_code assert because it got -1 from the modem (which is according to modem fault handling documentation).
That leads me to believe that the error happends in the nrf_getaddrinfo function (which is offloaded to the modem).

Can anyone make sense of the error based on the error code and PC? I'm kind of lost on where the debug at the moment :)

EDIT: I forgot to say that I'm experiencing this problem on a nRF9160 revision 0 using modem firmware 1.2.7, this also happends on 1.1.0.

Thanks in advance

Parents
  • Hi Ludwig,

    What is the goal with this approach for entering sleep? Have you considered PSM (Power Saving Mode)? PSM is efficient unless you would like to sleep for a very long time. This post covers important aspects when considering PSM. (The post is three years old, so it might contain some outdated information.)

    Also check out this post commenting on sleep mode in Zephyr.

    Which nRF Connect SDK are you using?

  • We are using v2.0.2 of the nRF Connect SDK.

    The problem with PSM is that not all networks support it and we need to support most networks. We would be offering a very poor service if we have to pick and choose customers based on what networks are near them.

    For example even the network which im testing against is not supporting PSM:

    [00:00:00.802,154] <dbg> modem_module: configure_low_power: PSM requested
    [00:00:00.809,906] <inf> modem_module: nRF9160 modem firmware version: mfw_nrf9160_1.2.7
    [00:00:00.810,485] <inf> app_event_manager: MODEM_EVT_INITIALIZED
    [00:00:00.861,511] <inf> app_event_manager: MODEM_EVT_LTE_CONNECTING
    [00:00:00.861,694] <inf> modem_module: LTE attempting to connect... waiting 120 seconds before timeout
    [00:00:00.861,755] <dbg> modem_module: state_set: State transition STATE_DISCONNECTED --> STATE_CONNECTING
    [00:00:02.867,492] <dbg> modem_module: modem_rsrp_handler: Incoming RSRP status message, RSRP value is -96
    [00:00:02.867,584] <dbg> lte_lc: at_handler_cereg: +CEREG notification: +CEREG: 2,"2724","01ADDA0B",7,0,0,"11100000","11100000"
    
    [00:00:02.867,858] <dbg> lte_lc_helpers: parse_cereg: Network registration status: 2
    [00:00:02.867,889] <dbg> lte_lc_helpers: parse_cereg: LTE mode: 7
    [00:00:02.867,980] <dbg> lte_lc_helpers: event_handler_list_dispatch: Dispatching events:
    [00:00:02.867,980] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x000587B9
    [00:00:02.868,377] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x0003BF09
    [00:00:02.868,408] <dbg> lte_lc_helpers: event_handler_list_dispatch: Done
    [00:00:02.868,438] <dbg> lte_lc_helpers: event_handler_list_dispatch: Dispatching events:
    [00:00:02.868,438] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x000587B9
    [00:00:02.868,469] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x0003BF09
    [00:00:02.868,499] <dbg> modem_module: lte_evt_handler: LTE cell changed: Cell ID: 28170763, Tracking area: 10020
    [00:00:02.868,499] <dbg> lte_lc_helpers: event_handler_list_dispatch: Done
    [00:00:02.868,530] <dbg> lte_lc_helpers: event_handler_list_dispatch: Dispatching events:
    [00:00:02.868,530] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x000587B9
    [00:00:02.868,560] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x0003BF09
    [00:00:02.868,560] <dbg> lte_lc_helpers: event_handler_list_dispatch: Done
    [00:00:02.868,682] <inf> app_event_manager: MODEM_EVT_LTE_RSRP_UPDATE
    [00:00:02.868,835] <inf> app_event_manager: MODEM_EVT_LTE_DISCONNECTED
    [00:00:02.953,704] <dbg> lte_lc: at_handler_cscon: +CSCON notification
    [00:00:02.953,857] <dbg> lte_lc_helpers: event_handler_list_dispatch: Dispatching events:
    [00:00:02.953,857] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x000587B9
    [00:00:02.953,887] <dbg> lte_lc_helpers: event_handler_list_dispatch:  - handler=0x0003BF09
    [00:00:02.953,887] <dbg> modem_module: lte_evt_handler: RRC mode: Connected
    [00:00:02.953,918] <dbg> lte_lc_helpers: event_handler_list_dispatch: Done
    [00:00:06.714,813] <dbg> lte_lc: at_handler_cereg: +CEREG notification: +CEREG: 5,"2724","01ADDA0B",7,,,"11100000","11100000"

    Either way, putting the modem in flight mode should be supported, right? And for some reason the modem is crashing because of that.

    I collected a modem trace on the nrf9160dk with firmware 1.2.3. There i get Modem error: 0x10, PC: 0xd90bc

    I've noticed that the modem crash happends during TLS negotiation. Specifically when the modem is supposed to send client key exchange. I've verified traffic logs from our operator.

    Attached is a modem trace from a nrf9160dk running fw1.2.3. The last thing i can see when converting to pcap is the "Certificate" sending to our server and then the ACK back. After that it crashes. Hopefully you can use the modem error and the trace to locate why?

    trace-2022-09-05T16-56-31.872Z.bin

  • Hi Ludwig,

    Would you be able to provide logs/modem trace from when using mfw1.3.2?

  • We would also recommend to make the move to mfw1.3.2, unless there is any specific reason to use mfw1.2.x

  • Our product currently uses B0 (rev1) silicon so we have to use 'mfw1.2.x'.

  • Hi, we found the issue in our code that triggers the modem crash.

    We had a race condition that sometimes put the modem in flight mode before disconnecting mqtt (and therefore the socket). I figured this out because i spun up another MQTT server and on that server i instead get -12 (ENOMEM) when calling zsock_connect (inside of mqtt_connect) which according to some devzone post means that no sockets are available. (It's not in any docs I can find but it makes sense)

    That means that the modem dosent clear sockets when in flight mode (which I assumed it did for some reason). It also seems that nrf_modem_lib_shutdown and nrf_modem_lib_init dosen't clear sockets either because I get continued errors after one modem crash.

    But the modem crash is clearly some bug that triggers because the modem thinks it has sockets available but it actually dosent. I don't have time to figure out what causes this but it presumably is because of combination of TLS cipher suites/certs and maybe port/hostname because that is all that differed between the mqtt servers i spun up.

    But you can assume our issue is fixed but I would give it to your modem team as a bug report ^^

  • Thank you for the update! I will let the modem team comment and get back to you.

Reply Children
  • I now have feed back from the modem team:

    Memory full causes assert:
    00:01:58.639038 TR_ERROR_MEMORY_FULL Failed allocation: size 2028, from heap 0080a0e8 
    00:01:58.639068 MDM_ERROR_ASSERT @PC:0x000d90bc 

    The modem supports up to three simultaneous TLS sockets. In this case the opening of a third TLS socket causes assert due to memory full. You have somehow hit an issue which consumes more memory than expected.

    You would need to go over the application logic. It leaves TLS sockets open thus all reservations for memory stays there.

    The application should close all unused sockets to free modem resources. The modem is indicating that PDN connection is closed (SOCKET_RPC_CLOSED_NTF_MSG) but it is the application's responsibility to close the socket.

Related