NCS v3.3.0: openthread_start_border_router_services() fails silently → ot br state stuck uninitialized

Summary

On NCS v3.3.0 with CONFIG_OPENTHREAD_ZEPHYR_BORDER_ROUTER=y and a single Ethernet AIL interface (W5500), the Border Router never initialises:

  • ot br state  uninitialized (forever)
  • ot br omrprefix  Error 13: InvalidState
  • ot br onlinkprefix  Error 13: InvalidState

The mesh comes up fine — the OTBR becomes leader, the child node attaches, MLE advertisements flow both ways. The problem is purely that the BR side of the device never initialises.

The cause is that the Zephyr L2 BR glue (zephyr/subsys/net/l2/openthread/openthread_border_router.c) calls openthread_start_border_router_services() from its NET_EVENT_IF_UP handler, that function returns -EIO, and the function contains no logging at all — so the failure is silent and there is no clue as to which inner step is failing.

Replaying the same inner platform-init sequence from app context after the eth IF_UP event succeeds reliably. So the steps themselves are fine; something about how they are invoked from the net_mgmt callback context is wrong.

Environment

  • nRF Connect SDK: v3.3.0 (v3.3.0-ba167d9f3db4)
  • Zephyr: v4.3.99-fd9204a02d52 (NCS bundle)
  • OpenThread: OPENTHREAD_SOURCES=y, upstream pin a03011cf7
  • Board: nrf54lm20dk/nrf54lm20a/cpuapp
  • AIL: W5500 SPI Ethernet, IPv6-only (CONFIG_NET_IPV4=n)
  • BR feature group: BORDER_ROUTING, BORDER_AGENT, MULTICAST_DNS, DNSSD_SERVER, BACKBONE_ROUTER, SRP_SERVER, NETDATA_PUBLISHER, DHCP6_PD, DHCP6_PD_CLIENT
  • TREL disabled (separate DevZone post — devzone-trel-mle-wedge.md)

Symptom

Boot log (trimmed):

[I] BorderAgent---: Border Agent start listening on port 49153
[I] BorderRouting-: RIO Preference changed: low -> medium
...mesh forms, OTBR is leader, child attaches...
[I] eth_w5500: w5500@0: Link up
[I] eth_w5500: w5500@0: Link speed 100 Mb, full duplex
                                                          <-- nothing else BR-related

Shell:

uart:~$ ot br state
uninitialized
Done



Diagnosis

We added an app-level NET_EVENT_IF_UP listener as a probe and observed:

br_probe: net_mgmt IF_UP on iface 2 (ethernet) admin=up carrier=on dormant=off
br_probe: calling openthread_start_border_router_services()
br_probe: openthread_start_border_router_services() = -5

So NET_EVENT_IF_UP does fire on eth, the L2 BR glue's ail_connection_handler does receive it, and openthread_start_border_router_services() returns -EIO. The wrapper in openthread_border_router.c looks like:

int openthread_start_border_router_services(struct net_if *ot_iface,
                                             struct net_if *ail_iface)
{
    ...
    if (otMdnsSetLocalHostName(...) != OT_ERROR_NONE) { error = -EIO; goto exit; }
    if (trel_plat_init(...)         != OT_ERROR_NONE) { error = -EIO; goto exit; }
    if (infra_if_init(...)          != OT_ERROR_NONE) { error = -EIO; goto exit; }
    if (udp_plat_init(...)          != OT_ERROR_NONE) { error = -EIO; goto exit; }
    if (mdns_plat_socket_init(...)  != OT_ERROR_NONE) { error = -EIO; goto exit; }
    if (dhcpv6_pd_client_init(...)  != OT_ERROR_NONE) { error = -EIO; goto exit; }
    if (border_agent_init(...)      != OT_ERROR_NONE) { error = -EIO; goto exit; }
    if (otBorderRoutingInit(...)    != OT_ERROR_NONE) { error = -EIO; goto exit; }
    ...
}

There are zero LOG_* calls in this function. So the actual failing step is unobservable without modifying NCS sources.

To find it, we re-implemented the same sequence in app code, called each step explicitly, and logged the per-step otError. All steps return OT_ERROR_NONE when invoked from app context after the eth iface IF_UP event:

br_probe step: otMdnsSetLocalHostName(...)             -> 0
br_probe step: trel_plat_init(...)                     -> 0
br_probe step: infra_if_init(...)                      -> 0
br_probe step: udp_plat_init(...)                      -> 0
br_probe step: mdns_plat_socket_init(...)              -> 0
br_probe step: dhcpv6_pd_client_init(...)              -> 0
br_probe step: border_agent_init(...)                  -> 0
br_probe step: otBorderRoutingInit(...)                -> 0
br_probe step: otBorderRoutingSetEnabled(..., true)    -> 0
br_probe step: otPlatInfraIfStateChanged(..., true)    -> 0

After this sequence runs, ot br state  running, OMR / on-link prefixes are generated and advertised, RAs flow on iface 2, fc00::/7 external route is published into Thread netdata, child SLAAC-configures an OMR address. Everything works.

Hypothesis

The wrapper's failure on the very first call but success on app re-entry suggests one of:

  1. Iface index race. The wrapper takes ail_iface_index = net_if_get_by_iface(ail_iface) very early. At the moment the L2 callback fires there may be transient state (carrier just came up, but some net_if bookkeeping still in-progress) that causes one of the inner socket binds to return non-zero.
  2. Mutex re-entry. The wrapper calls openthread_mutex_lock() and several inner steps invoke OT APIs that also lock. From a Zephyr net_mgmt callback, the locking nesting may differ from app context.
  3. border_agent_init re-entry. By the time the L2 IF_UP fires, the OT core has already started its own Border Agent listener (we see Border Agent start listening on port 49153 ~1s before eth IF_UP). A second call to border_agent_init() on a running BA could reasonably return non-OT_ERROR_NONE.

We were unable to bisect further without modifying NCS sources.

Workarounds we are running

src/br_start.c: an app-level NET_EVENT_IF_UP listener that, on the first eth IF_UP, calls each platform-init helper directly and then otBorderRoutingInit/SetEnabled/otPlatInfraIfStateChanged. This makes BR come up reliably and ot br state reach running within ~50 ms of the eth carrier coming up.

This is brittle because we are linking against private NCS symbols (infra_if_init, udp_plat_init, mdns_plat_socket_init, border_agent_init, dhcpv6_pd_client_init) by re-declaring them in our own source — they are non-static but only declared in the private platform-zephyr.h.

Suggested fixes

  1. Add per-step logging in openthread_start_border_router_services(). This single change would have made the diagnosis trivial. Replace each silent goto exit with LOG_ERR("step X failed: %u", err); goto exit;.
  2. Investigate the wrapper's -EIO from net_mgmt context. Our reproducer is small and we are happy to share it. Likely candidates per the hypothesis section above.
  3. Optionally export the platform-init helpers as a public, stable API, or add a public retry / kick API (openthread_border_router_kick(struct net_if *ail)) so apps that need to work around bring-up timing can do so without linking against private symbols.
Related