Summary
On NCS v3.3.0 with CONFIG_OPENTHREAD_ZEPHYR_BORDER_ROUTER=y and a single Ethernet AIL interface (W5500), the Border Router never initialises:
ot br state→uninitialized(forever)ot br omrprefix→Error 13: InvalidStateot br onlinkprefix→Error 13: InvalidState
The mesh comes up fine — the OTBR becomes leader, the child node attaches, MLE advertisements flow both ways. The problem is purely that the BR side of the device never initialises.
The cause is that the Zephyr L2 BR glue (zephyr/subsys/net/l2/openthread/openthread_border_router.c) calls openthread_start_border_router_services() from its NET_EVENT_IF_UP handler, that function returns -EIO, and the function contains no logging at all — so the failure is silent and there is no clue as to which inner step is failing.
Replaying the same inner platform-init sequence from app context after the eth IF_UP event succeeds reliably. So the steps themselves are fine; something about how they are invoked from the net_mgmt callback context is wrong.
Environment
- nRF Connect SDK: v3.3.0 (
v3.3.0-ba167d9f3db4) - Zephyr: v4.3.99-fd9204a02d52 (NCS bundle)
- OpenThread:
OPENTHREAD_SOURCES=y, upstream pina03011cf7 - Board:
nrf54lm20dk/nrf54lm20a/cpuapp - AIL: W5500 SPI Ethernet, IPv6-only (
CONFIG_NET_IPV4=n) - BR feature group:
BORDER_ROUTING,BORDER_AGENT,MULTICAST_DNS,DNSSD_SERVER,BACKBONE_ROUTER,SRP_SERVER,NETDATA_PUBLISHER,DHCP6_PD,DHCP6_PD_CLIENT - TREL disabled (separate DevZone post —
devzone-trel-mle-wedge.md)
Symptom
Boot log (trimmed):
[I] BorderAgent---: Border Agent start listening on port 49153
[I] BorderRouting-: RIO Preference changed: low -> medium
...mesh forms, OTBR is leader, child attaches...
[I] eth_w5500: w5500@0: Link up
[I] eth_w5500: w5500@0: Link speed 100 Mb, full duplex
<-- nothing else BR-related
Shell:
uart:~$ ot br state uninitialized Done
Diagnosis
We added an app-level NET_EVENT_IF_UP listener as a probe and observed:
br_probe: net_mgmt IF_UP on iface 2 (ethernet) admin=up carrier=on dormant=off
br_probe: calling openthread_start_border_router_services()
br_probe: openthread_start_border_router_services() = -5
So NET_EVENT_IF_UP does fire on eth, the L2 BR glue's ail_connection_handler does receive it, and openthread_start_border_router_services() returns -EIO. The wrapper in openthread_border_router.c looks like:
int openthread_start_border_router_services(struct net_if *ot_iface, struct net_if *ail_iface) { ... if (otMdnsSetLocalHostName(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } if (trel_plat_init(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } if (infra_if_init(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } if (udp_plat_init(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } if (mdns_plat_socket_init(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } if (dhcpv6_pd_client_init(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } if (border_agent_init(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } if (otBorderRoutingInit(...) != OT_ERROR_NONE) { error = -EIO; goto exit; } ... }
There are zero LOG_* calls in this function. So the actual failing step is unobservable without modifying NCS sources.
To find it, we re-implemented the same sequence in app code, called each step explicitly, and logged the per-step otError. All steps return OT_ERROR_NONE when invoked from app context after the eth iface IF_UP event:
br_probe step: otMdnsSetLocalHostName(...) -> 0
br_probe step: trel_plat_init(...) -> 0
br_probe step: infra_if_init(...) -> 0
br_probe step: udp_plat_init(...) -> 0
br_probe step: mdns_plat_socket_init(...) -> 0
br_probe step: dhcpv6_pd_client_init(...) -> 0
br_probe step: border_agent_init(...) -> 0
br_probe step: otBorderRoutingInit(...) -> 0
br_probe step: otBorderRoutingSetEnabled(..., true) -> 0
br_probe step: otPlatInfraIfStateChanged(..., true) -> 0
After this sequence runs, ot br state → running, OMR / on-link prefixes are generated and advertised, RAs flow on iface 2, fc00::/7 external route is published into Thread netdata, child SLAAC-configures an OMR address. Everything works.
Hypothesis
The wrapper's failure on the very first call but success on app re-entry suggests one of:
- Iface index race. The wrapper takes
ail_iface_index = net_if_get_by_iface(ail_iface)very early. At the moment the L2 callback fires there may be transient state (carrier just came up, but somenet_ifbookkeeping still in-progress) that causes one of the inner socket binds to return non-zero. - Mutex re-entry. The wrapper calls
openthread_mutex_lock()and several inner steps invoke OT APIs that also lock. From a Zephyrnet_mgmtcallback, the locking nesting may differ from app context. border_agent_initre-entry. By the time the L2 IF_UP fires, the OT core has already started its own Border Agent listener (we seeBorder Agent start listening on port 49153~1s before eth IF_UP). A second call toborder_agent_init()on a running BA could reasonably return non-OT_ERROR_NONE.
We were unable to bisect further without modifying NCS sources.
Workarounds we are running
src/br_start.c: an app-level NET_EVENT_IF_UP listener that, on the first eth IF_UP, calls each platform-init helper directly and then otBorderRoutingInit/SetEnabled/otPlatInfraIfStateChanged. This makes BR come up reliably and ot br state reach running within ~50 ms of the eth carrier coming up.
This is brittle because we are linking against private NCS symbols (infra_if_init, udp_plat_init, mdns_plat_socket_init, border_agent_init, dhcpv6_pd_client_init) by re-declaring them in our own source — they are non-static but only declared in the private platform-zephyr.h.
Suggested fixes
- Add per-step logging in
openthread_start_border_router_services(). This single change would have made the diagnosis trivial. Replace each silentgoto exitwithLOG_ERR("step X failed: %u", err); goto exit;. - Investigate the wrapper's
-EIOfrom net_mgmt context. Our reproducer is small and we are happy to share it. Likely candidates per the hypothesis section above. - Optionally export the platform-init helpers as a public, stable API, or add a public retry / kick API (
openthread_border_router_kick(struct net_if *ail)) so apps that need to work around bring-up timing can do so without linking against private symbols.