nRF91 modem watchdog fault when re-activating LTE

I have an application with a shipping mode that when enabled/disabled calls `conn_mgr_all_if_connect/conn_mgr_all_if_disconnect`.
These functions eventually end up in `lte_net_if.c`, running `lte_lc_func_mode_set` with either `LTE_LC_FUNC_MODE_ACTIVATE_LTE` or `LTE_LC_FUNC_MODE_DEACTIVATE_LTE`.

Connecting to the LTE network on boot works fine, as does entering shipping mode.
The modem fault occurs when attempting to re-activate the modem after it has already been active.
If less than a minute has passed since the modem was de-activated, it re-activates fine and the application continues as per normal.
If over a minute has passed, the modem faults more or less immediately with:

[00:01:26.754,638] <inf> app: Device activated, enable LTE
[00:01:26.982,513] <err> nrf_modem: Modem has crashed, reason 0x2, PC: 0x468de

This issue is 100% reproducible (happens every time), always with the same error code and program counter.
The error code corresponds with NRF_MODEM_FAULT_HW_WD_RESET.

Replacing `LTE_LC_FUNC_MODE_ACTIVATE_LTE` and `LTE_LC_FUNC_MODE_DEACTIVATE_LTE` with `LTE_LC_FUNC_MODE_NORMAL` and `LTE_LC_FUNC_MODE_OFFLINE` does not change the behavior of the fault.

nrfxlib version: Tag v3.1.1 (Commit 3b210a24d3bc7ecfc268e0feab6436306b11e7cb)
Modem firmware: mfw_nrf91x1_2.0.4
Modem model: nRF9151-LACA

Parents
  • Hi,

    Thanks for the detailed report. NRF_MODEM_FAULT_HW_WD_RESET (0x2) means the modem firmware has become unresponsive and its internal hardware watchdog has reset the modem. And from your description, this is triggered when LTE is deactivated for a longer period and then re-activated. Can you try fully shutting down and re-initializing the modem in the following way as mentioned in the documentation:

    • When entering shipping mode: stop LTE users and call nrf_modem_shutdown()
    • When exiting shipping mode: call nrf_modem_init() and then reconnect LTE

    If the issue still reproduces after using this sequence, please enable modem traces and share them so it can be investigated further.

    Best Regards,
    Syed Maysum

  • > Can you try fully shutting down and re-initializing the modem in the following way as mentioned in the documentation:

    I unintentionally left out some information in the original post. The activation and de-activation code paths look like this:

    		LOG_INF("Device activated, enable LTE");
    		conn_mgr_all_if_up(false);
    		conn_mgr_all_if_connect(false);
    		...
    		LOG_INF("Device de-activated, disable LTE");
    		conn_mgr_all_if_disconnect(false);
    		conn_mgr_all_if_down(false);

    The connect/disconnect calls do indeed call the functional modes I mentioned, but the `if_up` and `if_down` calls get routed down to lte_net_if_enable and lte_net_if_disable, which are already fully shutting down and re-initialising the modem.

    This was confirmed by adding extra logging inside those functions, but also by the trace module logging (see below).

    > If the issue still reproduces after using this sequence, please enable modem traces and share them so it can be investigated further.

    I have enabled modem traces (attached below), however I am doubtful the trace will be useful in practice, since the tracing module suspends in response to the modem powering down, and there is no logic to re-enable it once it comes back up.

    [00:00:01.783,386] <inf> modem_trace_backend: Modem_trace RTT backend channel 1
    [00:00:01.783,416] <inf> nrf_modem_lib_trace: Trace thread ready
    [00:00:01.784,942] <inf> nrf_modem_lib_trace: Trace level override: 2
    ...
    [00:00:26.409,027] <inf> app: Device de-activated, disable LTE
    [00:00:27.173,004] <inf> epacket_udp: Network disconnected
    [00:00:27.189,422] <inf> nrf_modem_lib_trace: Modem was turned off, no more traces
    ...
    [00:02:02.850,982] <inf> app: Device activated, enable LTE
    [00:02:03.280,090] <err> nrf_modem: Modem has crashed, reason 0x2, PC: 0x468de
    [00:02:03.280,120] <err> modem_monitor: Modem fault, rebooting in 2 seconds...


     1770250638_nrf_modem_trace.zip

    After removing the calls to `conn_mgr_all_if_up/down` (and therefore not shutting down the modem entirely), I am able to transition between shipping and active modes without faults. However this doesn't explain the root cause of this issue, which is how a hardware watchdog can expire less than a second after booting from the shutdown state.

  • Hi,

    Thanks for the clarification. From your testing, the watchdog reset only occurs when the modem is fully shut down and re-initialized (conn_mgr_all_if_down/up), and not when simply disconnecting and reconnecting LTE. This suggests the issue is tied to the modem restart path rather than LTE reactivation itself.

    To help us reproduce and investigate this further, could you please share the nRF Connect SDK version you are using and a minimal code example or sample project that reproduces the issue? With that, we can try to reproduce this locally and determine the next steps.

    Best Regards,
    Syed Maysum

Reply
  • Hi,

    Thanks for the clarification. From your testing, the watchdog reset only occurs when the modem is fully shut down and re-initialized (conn_mgr_all_if_down/up), and not when simply disconnecting and reconnecting LTE. This suggests the issue is tied to the modem restart path rather than LTE reactivation itself.

    To help us reproduce and investigate this further, could you please share the nRF Connect SDK version you are using and a minimal code example or sample project that reproduces the issue? With that, we can try to reproduce this locally and determine the next steps.

    Best Regards,
    Syed Maysum

Children
  • > could you please share the nRF Connect SDK version you are using

    I do not use nRF Connect SDK directly, but the modem libraries are extracted from NCS v3.1.0.

    > and a minimal code example or sample project that reproduces the issue?

    The following sample application reproduces the issue on a nRF9161DK.

    https://github.com/Embeint/infuse-sdk/tree/example/nrf91_modem_fault/samples/nrf91_modem_fault

    west build -b nrf9161dk/nrf9161/ns infuse-sdk/samples/nrf91_modem_fault/
    west flash -d build/nrf9161dk/nrf9161/ns/nrf91_modem_fault/
    ...
    *** Booting Zephyr OS build v4.3.0-168-g7eab92b0458b ***
    [00:00:00.401,092] <inf> infuse:        Version: 0.0.0+00000000
    [00:00:00.401,153] <inf> infuse:          Board: [email protected]/nrf9161/ns
    [00:00:02.079,193] <inf> app: Enable LTE for first time
    [00:00:02.121,887] <inf> app: Waiting 15 seconds to enter shipping mode...
    [00:00:17.122,009] <inf> app: Bring down LTE for shipping mode
    [00:00:17.176,696] <inf> app: 65 seconds to exit shipping mode
    [00:01:22.177,886] <inf> app: Exiting shipping mode
    [00:01:22.405,639] <err> nrf_modem: Modem has crashed, reason 0x2, PC: 0x468de
    [00:01:22.405,639] <err> modem_monitor: Modem fault, rebooting in 2 seconds...

    Note the example is setup to output serial logs over RTT, not the serial port.

  • Hi,

    Thanks for sharing the sample and detailed logs. We’ll set this up and try reproducing the behavior on our side, and will get back to you early next week with an update.

    Best Regards,
    Syed Maysum

  • Hi,

    We tested the same shutdown, wait and re-initialization sequence in a clean nRF Connect SDK v3.1.0 environment on nRF9161DK. We have attached the main.c and prj.conf used for this test for your reference. The sequence was to connect LTE, bring LTE offline, call nrf_modem_lib_shutdown(), wait for more than 60 seconds, then call nrf_modem_lib_init() and reconnect LTE. In this setup, the modem reliably re-initializes and reconnects after the wait period, and we do not see an NRF_MODEM_FAULT_HW_WD_RESET (0x2).

    *** Booting nRF Connect SDK v3.1.0-6c6e5b32496e ***
    *** Using Zephyr OS v4.1.99-1612683d4010 ***
    [00:00:00.406,127] <inf> app: Init modem library (boot)
    [00:00:00.653,900] <inf> app: nrf_modem_lib_init() err=0
    [00:00:00.653,930] <inf> app: Enable LTE for first time
    [00:00:07.089,691] <inf> app: lte_lc_connect() err=0
    [00:00:07.089,721] <inf> app: Waiting 15 seconds to enter shipping mode...
    [00:00:22.089,782] <inf> app: Bring down LTE for shipping mode
    [00:00:22.549,072] <inf> app: lte_lc_offline() err=0
    [00:00:22.549,072] <inf> app: Shutting down modem library...
    [00:00:22.566,162] <inf> app: nrf_modem_lib_shutdown() err=0
    [00:00:22.566,162] <inf> app: 65 seconds to exit shipping mode
    [00:00:27.566,253] <inf> app: 60 seconds to exit shipping mode
    [00:00:32.566,345] <inf> app: 55 seconds to exit shipping mode
    [00:00:37.566,436] <inf> app: 50 seconds to exit shipping mode
    [00:00:42.566,528] <inf> app: 45 seconds to exit shipping mode
    [00:00:47.566,619] <inf> app: 40 seconds to exit shipping mode
    [00:00:52.566,711] <inf> app: 35 seconds to exit shipping mode
    [00:00:57.566,802] <inf> app: 30 seconds to exit shipping mode
    [00:01:02.566,894] <inf> app: 25 seconds to exit shipping mode
    [00:01:07.566,986] <inf> app: 20 seconds to exit shipping mode
    [00:01:12.567,077] <inf> app: 15 seconds to exit shipping mode
    [00:01:17.567,169] <inf> app: 10 seconds to exit shipping mode
    [00:01:22.567,260] <inf> app: 5 seconds to exit shipping mode
    [00:01:27.567,352] <inf> app: Re-init modem library...
    [00:01:27.815,948] <inf> app: nrf_modem_lib_init() err=0
    [00:01:28.016,052] <inf> app: Exiting shipping mode: reconnect LTE
    [00:01:34.214,813] <inf> app: lte_lc_connect() err=0

    This indicates the watchdog reset is not caused by the modem firmware or the shutdown delay itself, but is likely related to how LTE and the modem are being controlled in the application when using the connection-manager path.

    As next steps to isolate the issue in your setup, we recommend first ensuring that all components that may access the modem are fully stopped before calling conn_mgr_all_if_down(). In particular, please verify that all LTE sockets are closed, any DNS/REST/HTTP clients are stopped, no AT commands are in progress, and modem tracing (if enabled) is idle.

    We also recommend running the LTE down/up sequence from a dedicated application context (for example an application thread or work item), rather than directly from a network or connection-manager callback. As a diagnostic step, you can temporarily bypass conn_mgr_all_if_*() and directly call lte_lc_offline(), nrf_modem_lib_shutdown(), nrf_modem_lib_init(), and lte_lc_connect() to confirm whether the watchdog reset is specific to the connection-manager integration.

    Best Regards,
    Syed Maysum

    4403.prj.conf

    #include <zephyr/kernel.h>
    #include <zephyr/logging/log.h>
    #include <modem/lte_lc.h>
    #include <modem/nrf_modem_lib.h>
    
    LOG_MODULE_REGISTER(app, LOG_LEVEL_INF);
    
    static void lte_connect_once(void)
    {
    	int err = lte_lc_connect();
    	LOG_INF("lte_lc_connect() err=%d", err);
    }
    
    static void lte_go_offline(void)
    {
    	int err = lte_lc_offline();
    	LOG_INF("lte_lc_offline() err=%d", err);
    }
    
    int main(void)
    {
    	int err;
    
    	LOG_INF("Init modem library (boot)");
    	err = nrf_modem_lib_init();
    	LOG_INF("nrf_modem_lib_init() err=%d", err);
    	if (err) {
    		LOG_ERR("Modem lib init failed: %d", err);
    		return 0;
    	}
    
    	LOG_INF("Enable LTE for first time");
    	lte_connect_once();
    
    	LOG_INF("Waiting 15 seconds to enter shipping mode...");
    	k_sleep(K_SECONDS(15));
    
    	LOG_INF("Bring down LTE for shipping mode");
    	lte_go_offline();
    
    	LOG_INF("Shutting down modem library...");
    	err = nrf_modem_lib_shutdown();
    	LOG_INF("nrf_modem_lib_shutdown() err=%d", err);
    
    	int countdown = 65;
    	while (countdown > 0) {
    		LOG_INF("%d seconds to exit shipping mode", countdown);
    		k_sleep(K_SECONDS(5));
    		countdown -= 5;
    	}
    
    	LOG_INF("Re-init modem library...");
    	err = nrf_modem_lib_init();
    	LOG_INF("nrf_modem_lib_init() err=%d", err);
    
    	k_sleep(K_MSEC(200));
    
    	LOG_INF("Exiting shipping mode: reconnect LTE");
    	lte_connect_once();
    
    	k_sleep(K_FOREVER);
    }
    

  • > We tested the same shutdown, wait and re-initialization sequence in a clean nRF Connect SDK v3.1.0 environment on nRF9161DK. In this setup, the modem reliably re-initializes and reconnects after the wait period, and we do not see an NRF_MODEM_FAULT_HW_WD_RESET (0x2).

    Was any attempt made to run the minimal reproducible sample that you asked for and I provided?

    > This indicates the watchdog reset is not caused by the modem firmware or the shutdown delay itself, but is likely related to how LTE and the modem are being controlled in the application when using the connection-manager path.

    That is not a surprise (in that extra work needs to happen to trigger the fault, e.g. UDP packets).

    > As next steps to isolate the issue in your setup, we recommend first ensuring that all components that may access the modem are fully stopped before calling conn_mgr_all_if_down(). In particular, please verify that all LTE sockets are closed, any DNS/REST/HTTP clients are stopped, no AT commands are in progress, and modem tracing (if enabled) is idle.

    So the suggestion is to throw out all the integrations with Zephyr APIs that Nordic has written? I have little interest in attempting to re-architecture my networking flow to diagnose a fault in a component I don't control and can't debug just because you haven't run the sample I have already spent time providing.

    > We also recommend running the LTE down/up sequence from a dedicated application context (for example an application thread or work item), rather than directly from a network or connection-manager callback.

    You can already see this is being done in the provided example.

    > to confirm whether the watchdog reset is specific to the connection-manager integration.

    This is a watchdog failure inside the modem core. How could it have anything to do with the connection manager? The modem literally faults before `nrf_modem_lib_init` even returns.

  • Hi,

    Thanks for the detailed feedback. I understand your point, and I agree that reworking application architecture just to debug a modem fault isn’t ideal.

    I are currently trying to build and run the exact minimal sample you shared, using your west workspace. While setting this up in a clean environment, I am hitting a TF-M build error on nrf9161dk/nrf9161/ns. I am checking it internally and once we have the sample building, we’ll proceed with reproducing the modem watchdog issue and report back.

    Best Regards,
    Syed Maysum

Related