nRF91 modem watchdog fault when re-activating LTE

I have an application with a shipping mode that when enabled/disabled calls `conn_mgr_all_if_connect/conn_mgr_all_if_disconnect`.
These functions eventually end up in `lte_net_if.c`, running `lte_lc_func_mode_set` with either `LTE_LC_FUNC_MODE_ACTIVATE_LTE` or `LTE_LC_FUNC_MODE_DEACTIVATE_LTE`.

Connecting to the LTE network on boot works fine, as does entering shipping mode.
The modem fault occurs when attempting to re-activate the modem after it has already been active.
If less than a minute has passed since the modem was de-activated, it re-activates fine and the application continues as per normal.
If over a minute has passed, the modem faults more or less immediately with:

[00:01:26.754,638] <inf> app: Device activated, enable LTE
[00:01:26.982,513] <err> nrf_modem: Modem has crashed, reason 0x2, PC: 0x468de

This issue is 100% reproducible (happens every time), always with the same error code and program counter.
The error code corresponds with NRF_MODEM_FAULT_HW_WD_RESET.

Replacing `LTE_LC_FUNC_MODE_ACTIVATE_LTE` and `LTE_LC_FUNC_MODE_DEACTIVATE_LTE` with `LTE_LC_FUNC_MODE_NORMAL` and `LTE_LC_FUNC_MODE_OFFLINE` does not change the behavior of the fault.

nrfxlib version: Tag v3.1.1 (Commit 3b210a24d3bc7ecfc268e0feab6436306b11e7cb)
Modem firmware: mfw_nrf91x1_2.0.4
Modem model: nRF9151-LACA

Parents
  • Hi,

    Thanks for the detailed report. NRF_MODEM_FAULT_HW_WD_RESET (0x2) means the modem firmware has become unresponsive and its internal hardware watchdog has reset the modem. And from your description, this is triggered when LTE is deactivated for a longer period and then re-activated. Can you try fully shutting down and re-initializing the modem in the following way as mentioned in the documentation:

    • When entering shipping mode: stop LTE users and call nrf_modem_shutdown()
    • When exiting shipping mode: call nrf_modem_init() and then reconnect LTE

    If the issue still reproduces after using this sequence, please enable modem traces and share them so it can be investigated further.

    Best Regards,
    Syed Maysum

  • > Can you try fully shutting down and re-initializing the modem in the following way as mentioned in the documentation:

    I unintentionally left out some information in the original post. The activation and de-activation code paths look like this:

    		LOG_INF("Device activated, enable LTE");
    		conn_mgr_all_if_up(false);
    		conn_mgr_all_if_connect(false);
    		...
    		LOG_INF("Device de-activated, disable LTE");
    		conn_mgr_all_if_disconnect(false);
    		conn_mgr_all_if_down(false);

    The connect/disconnect calls do indeed call the functional modes I mentioned, but the `if_up` and `if_down` calls get routed down to lte_net_if_enable and lte_net_if_disable, which are already fully shutting down and re-initialising the modem.

    This was confirmed by adding extra logging inside those functions, but also by the trace module logging (see below).

    > If the issue still reproduces after using this sequence, please enable modem traces and share them so it can be investigated further.

    I have enabled modem traces (attached below), however I am doubtful the trace will be useful in practice, since the tracing module suspends in response to the modem powering down, and there is no logic to re-enable it once it comes back up.

    [00:00:01.783,386] <inf> modem_trace_backend: Modem_trace RTT backend channel 1
    [00:00:01.783,416] <inf> nrf_modem_lib_trace: Trace thread ready
    [00:00:01.784,942] <inf> nrf_modem_lib_trace: Trace level override: 2
    ...
    [00:00:26.409,027] <inf> app: Device de-activated, disable LTE
    [00:00:27.173,004] <inf> epacket_udp: Network disconnected
    [00:00:27.189,422] <inf> nrf_modem_lib_trace: Modem was turned off, no more traces
    ...
    [00:02:02.850,982] <inf> app: Device activated, enable LTE
    [00:02:03.280,090] <err> nrf_modem: Modem has crashed, reason 0x2, PC: 0x468de
    [00:02:03.280,120] <err> modem_monitor: Modem fault, rebooting in 2 seconds...


     1770250638_nrf_modem_trace.zip

    After removing the calls to `conn_mgr_all_if_up/down` (and therefore not shutting down the modem entirely), I am able to transition between shipping and active modes without faults. However this doesn't explain the root cause of this issue, which is how a hardware watchdog can expire less than a second after booting from the shutdown state.

  • Hi,

    Thanks for the detailed feedback. I understand your point, and I agree that reworking application architecture just to debug a modem fault isn’t ideal.

    I are currently trying to build and run the exact minimal sample you shared, using your west workspace. While setting this up in a clean environment, I am hitting a TF-M build error on nrf9161dk/nrf9161/ns. I am checking it internally and once we have the sample building, we’ll proceed with reproducing the modem watchdog issue and report back.

    Best Regards,
    Syed Maysum

  • > I are currently trying to build and run the exact minimal sample you shared, using your west workspace. While setting this up in a clean environment, I am hitting a TF-M build error on nrf9161dk/nrf9161/ns.

    Can you share the build error? A clean environment build works fine locally for me.

    (.zephyr_venv) jordan@TAURUS:~/code$ mkdir fault_demo
    (.zephyr_venv) jordan@TAURUS:~/code$ cd fault_demo/
    (.zephyr_venv) jordan@TAURUS:~/code/fault_demo$ west init -m [email protected]:Embeint/infuse-sdk.git
    (.zephyr_venv) jordan@TAURUS:~/code/fault_demo$ cd infuse-sdk/
    (.zephyr_venv) jordan@TAURUS:~/code/fault_demo/infuse-sdk$ git checkout example/nrf91_modem_fault
    (.zephyr_venv) jordan@TAURUS:~/code/fault_demo/infuse-sdk$ west update
    (.zephyr_venv) jordan@TAURUS:~/code/fault_demo/infuse-sdk$ west build -b nrf9161dk/nrf9161/ns ./samples/nrf91_modem_fault/
    ...
    [506/506] Linking C executable zephyr/zephyr.elf
    Memory region         Used Size  Region Size  %age Used
               FLASH:      193308 B       704 KB     26.81%
                 RAM:       79684 B       168 KB     46.32%
         RetainedMem:          0 GB        256 B      0.00%
            IDT_LIST:          0 GB        32 KB      0.00%
    Generating files from /home/jordan/code/fault_demo/infuse-sdk/build/zephyr/zephyr.elf for board: nrf9161dk

  • Hi,

    Thanks for sharing your build steps.

    On my side (Windows environment), I set up a clean workspace and followed essentially the same flow as you:

    mkdir fault_demo
    cd fault_demo
    west init -m https://github.com/Embeint/infuse-sdk.git
    west update
    cd infuse-sdk
    git checkout example/nrf91_modem_fault
    west update
    west build -b nrf9161dk/nrf9161/ns samples/nrf91_modem_fault -p always

    However, the build fails during the TF-M stage with:

    zephyr/modules/trusted-firmware-m/nordic/include/RTE_Device.h:14:10:
    fatal error: zephyr/devicetree.h: No such file or directory

    The file does exist in the workspace, so it appears that the TF-M compile step is not picking up Zephyr’s include path on my host. I’m checking this internally to proceed with reproducing the watchdog issue and will get back to you by tomorrow.

    Best Regards,
    Syed Maysum

  • Hi,

    I’ve now been able to reproduce the behavior on our side using the provided nrf91_modem_fault example on nrf9161dk.

    (For transparency: on Windows we initially hit a TF-M build issue related to symlink handling in the Zephyr workspace. After enabling Windows Developer Mode and ensuring Git symlinks were properly supported, the project built correctly.)

    The sequence behaves as you described:

    • LTE connects correctly on boot then entering shipping mode works fine
    • If LTE is reactivated within ~1 minute, it reconnects successfully
    • If LTE is reactivated after ~60+ seconds of deactivation, the modem consistently crashes with: 
      [00:01:23.234,283] <err> nrf_modem: Modem has crashed, reason 0x2, PC: 0x467da

    The issue is fully reproducible and occurs consistently with the same fault reason and program counter. Since this appears to be modem-side behavior during functional mode reactivation, we are escalating this internally for further investigation.

    I will update you as soon as we have feedback from the modem team.

    Best Regards,
    Syed Maysum

  • Hi Syed,

    Thank you for persisting with getting the reproducing sample working and for escalating internally.

    The Windows error is interesting, it would be amazing if you could submit a PR upstream to update https://docs.zephyrproject.org/latest/services/tfm/overview.html with the error that can occur on Windows and how to fix it, as I am certain you will not be the only person to run into it.

Reply Children
Related