NCS v3.3.0: OPENTHREAD_BORDER_ROUTING + OPENTHREAD_TREL wedges MLE re-attach

Summary

With CONFIG_OPENTHREAD_BORDER_ROUTING=y (the default selected by CONFIG_OPENTHREAD_ZEPHYR_BORDER_ROUTER=y) and CONFIG_OPENTHREAD_TREL=y (also the default), the OpenThread thread blocks indefinitely during MLE re-attach on the very first MAC TX. The device boots, reads its persisted dataset, sends a Link Request, and then never makes any further forward progress in OT. Mesh never attaches; ot state stays disabled/detached forever.

The Thread radio path itself is fine — disabling TREL fixes the wedge completely without any other change.

Environment

  • nRF Connect SDK: v3.3.0 (v3.3.0-ba167d9f3db4)
  • Zephyr: v4.3.99-fd9204a02d52 (NCS bundle)
  • OpenThread: pinned to upstream commit a03011cf7 via the app's west override (OPENTHREAD_SOURCES=y is required because the prebuilt Nordic OT libraries don't expose the BR API surface; that is a separate issue, see DevZone #127950).
  • Board: nrf54lm20dk/nrf54lm20a/cpuapp (also reproducible on nrf5340dk/nrf5340/cpuapp per earlier bring-up).
  • App: a Border Router on top of a W5500 SPI Ethernet module (CONFIG_NET_L2_ETHERNET=y, CONFIG_ETH_W5500=y, CONFIG_NET_IPV6=y, CONFIG_NET_IPV4=n).

Relevant Kconfig (failing case):

CONFIG_OPENTHREAD=y
CONFIG_OPENTHREAD_SOURCES=y
CONFIG_OPENTHREAD_THREAD_VERSION_1_4=y
CONFIG_OPENTHREAD_FTD=y
CONFIG_OPENTHREAD_ZEPHYR_BORDER_ROUTER=y
CONFIG_OPENTHREAD_BORDER_ROUTING=y
CONFIG_OPENTHREAD_TREL=y                 # default
CONFIG_OPENTHREAD_TREL_MANAGE_DNSSD=y    # default


Symptom

Boot log (trimmed) up to the wedge:

[I] Mle---: Send Link Request (ff02::2)
[I] MeshForwarder-: Sent IPv6 UDP msg, len:82, ... to:0xffff
                                                                <-- nothing further from OT

Stack trace at the point of stall (taken with THREAD_ANALYZER and a JLink halt) shows the OT thread parked inside the multi-link MAC TX path:

otPlatTrelSend           (zephyr/modules/openthread/platform/trel.c)
  -> Mac::Links::Send (...)
  -> MeshForwarder::HandleSentFrame
  -> ot_task_main

otPlatTrelSend() in Zephyr's TREL platform glue performs network setup work (socket bind / setsockopt(SO_BINDTODEVICE) / net_mgmt registration) inline on the OT thread. At startup these calls block waiting for the AIL netif and IPv6 to be ready — but the OT thread is itself the one driving net-iface bring-up downstream, so the system is wedged.

Why TREL is even on the TX path here

With OPENTHREAD_BORDER_ROUTING=y OpenThread compiles Mac::Links (the multi-link MAC) instead of the single-link 802.15.4-only path. Every MAC TX iterates both the 802.15.4 link and the TREL link. So even on a leaf-but-BR-capable device with no TREL peers, every TX attempt goes through otPlatTrelSend, which is enough to block the thread on the very first packet.

Workaround we are running

Disabling TREL eliminates the wedge:

CONFIG_OPENTHREAD_TREL=n
CONFIG_OPENTHREAD_TREL_MANAGE_DNSSD=n

Two extra pieces are needed to make this build & link cleanly with OPENTHREAD_ZEPHYR_BORDER_ROUTER=y — those are the subject of a separate DevZone post (devzone-trel-build-broken.md):

  1. Provide a stub otPlatTrelHandleReceived and a no-op trel_plat_init returning OT_ERROR_NONE.
  2. Drop zephyr/modules/openthread/platform/trel.c from the build via the app's CMakeLists.txt.

After that, MLE attaches in <1 s on every boot, BR services come up, OMR / on-link prefixes are advertised, etc.

Suggested upstream fixes (not mutually exclusive)

  1. Defer TREL platform init. Make trel_plat_init and the first otPlatTrelSend non-blocking with respect to AIL netif readiness. Either queue outbound TREL packets until the AIL is up, or return OT_ERROR_INVALID_STATE on TX-before-init so OT keeps moving on the 802.15.4 link.
  2. Decouple "BR feature group" from OPENTHREAD_TREL. OPENTHREAD_BORDER_ROUTING=y should not implicitly force the multi-link MAC path through TREL when no TREL peers exist. A device should be able to be a BR-on-Ethernet without participating in TREL.
  3. Better diagnostics. Even if the deadlock cannot be eliminated immediately, otPlatTrelSend should log the offending blocking call so this presents as a clear error rather than a silent stall.

Reproducer

west build -b nrf54lm20dk/nrf54lm20a/cpuapp <app> with the failing Kconfig above; provision a dataset; reset; observe MLE never attaches.

Toggle CONFIG_OPENTHREAD_TREL=n (plus the build/link workarounds above) and the wedge is gone.

Parents Reply Children
Related