nrf5340/nrf7002 wifi stack in case of problems connecting to AP?

Using nrf5340 with nrf7002 on a custom board, and trying to get a stable wifi operation (including in situations where the wifi AP availability will vary, as this is a mobile device...)

Having experienced problems with the wifi driver ending up 'stuck', I updated from NS 2.6 to 2.8 (what a pain that was). It is more stable now, but still sometimes fails to find my local AP (even without movement!). I get logs like this (even though I do not use the wifi credentials system)
[00:03:23.501,251] <err> wpa_supp: Line 0: invalid key_mgmt 'SAE'
and more relevantly this:
[00:00:29.706,390] <err> wifi_nrf: nrf_wifi_wpa_supp_scan_abort: Timedout waiting for scan abort response, ret = -11
and
[00:01:17.710,052] <err> wpa_supp: wpa_drv_zep_get_scan_results2: Timed out waiting for scan results

Sometimes it ends up connecting even with the wpa_supp logs, but the wifi_nrf one seems to be bad....

After requesting connect I set a timeout (12s) - when this pops, I request a disconnect.
int status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, ctx->iface, NULL, 0);

After 4 attempts that end like this, I attempt to recover by setting the interface down then up again.

// Reset wifi by putting interface down then up
static bool _wifi_reset(struct _netwifi_ctx* ctx) {
  // make interface active
  int ret = 0;
  ret = net_if_down(ctx->iface);
  if (ret==0 || ret==-EALREADY) {
    log_info("netwifi:iface is down!");
    // Wait a little bit
    k_msleep(100);
    ret = net_if_up(ctx->iface);
    if (ret==0 || ret==-EALREADY) {
      log_info("netwifi: iface is up!");
      return true;
    }
    log_warn("netwifi: iface failed to become up (%d)",ret);
  } else {
    log_warn("netwifi: iface failed to become down (%d)",ret);
  }
  return false;
}

This systematically results in a bus fault:

[00:09:23.517,608] <wrn> app: netwifi: connect check timer pops, connect() retry ongoing...
[00:09:35.517,639] <wrn> app: netwifi: connect timeout, too many (4), trying wifi reset
[00:09:46.209,045] <err> wifi_nrf: nrf_wifi_fmac_chg_vif_state: RPU is unresponsive for 10 sec
[00:09:46.218,444] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_fmac_chg_vif_state failed
[00:09:46.229,095] <inf> app: netwifi:iface is down!
[00:09:46.337,493] <inf> wifi_nrf_bus: SPIM spi@a000: freq = 24 MHz
[00:09:46.344,268] <inf> wifi_nrf_bus: SPIM spi@a000: latency = 1
[00:09:46.529,388] <err> wpa_supp: zephyr_get_handle_by_ifname: Unable to get wpa_s handle for wlan0
[00:09:46.539,520] <err> wpa_supp: Interface wlan0 not found
[00:09:46.544,555] <inf> app: netwifi: iface is up!
[00:09:58.530,639] <err> wpa_supp: wpa_drv_zep_scan_timeout: Scan timeout - try to abort it
[00:09:58.539,733] <err> os: ***** BUS FAULT *****
[00:09:58.545,257] <err> os: Precise data bus error
[00:09:58.551,055] <err> os: BFAR Address: 0x11f3ef53
[00:09:58.557,037] <err> os: r0/a1: 0x20059a80 r1/a2: 0x00000000 r2/a3: 0x00000000
[00:09:58.565,826] <err> os: r3/a4: 0x11f3ef47 r12/ip: 0x00000000 r14/lr: 0x00037af9
[00:09:58.574,615] <err> os: xpsr: 0x61000000
[00:09:58.579,895] <err> os: Faulting instruction address (r15/pc): 0x00038588
[00:09:58.587,890] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:09:58.595,886] <err> os: Current thread: 0x200061a8 (unknown)
[00:09:58.602,722] <err> os: Halting system

The debugger says this code in NCS modules/lib/hostap/src/drivers/driver_zephyr.c  is at fault, when it tries to call dev_ops->scan_abort.

static int wpa_drv_zep_abort_scan(void *priv,
   u64 scan_cookie)
{
  struct zep_drv_if_ctx *if_ctx = NULL;
  const struct zep_wpa_supp_dev_ops *dev_ops;
  int ret = -1;

  if_ctx = priv;

  dev_ops = get_dev_ops(if_ctx->dev_ctx);
  if (!dev_ops->scan_abort) {
    wpa_printf(MSG_ERROR,
      "%s: No op registered for scan_abort",
      __func__);
    goto out;
  }

  ret = dev_ops->scan_abort(if_ctx->dev_priv);
out:
  return ret;
}

dev_ops points to a structure where all the pointers are NULL, but I think even that pointer is bad (0x11f3ef47 is neither flash nor RAM?)....

Presumably wpa_supp is trying to abort the scan from the previous connection attempt, but hasn't dealt with the if-down/if-up restart correctly, so is holding on to a device context that is no longer valid...

I note the 'RPU is unresponsive' log... I have CONFIG_NRF_WIFI_RPU_RECOVERY=y as per a prior ticket about the wifi instability on NCS 2.6.x, which is why I updated to NCS 2.8.0....

Q: what could be causing the scan results errors that are causing it to be stuck?

Q: How to correctly stop/restart the wifi interface to recover from it being 'stuck'?

  • You can still use NCS 2.8 and 2.9 with parent-child images by adding --no-sysbuild since sysbuild is default option, but it may be removed in the future release.

    Ok, good. This text in 2.7 made me assume parent-child would be removed in 2.9

    The deprecated methods are scheduled for removal after the next release.

    (since 2.9 would be the one after 2.8 being the next)

  • and of course the cmake build scripts in 2.8 are pretty direct about it:

    ---------------------------------------------------------------------
    --- WARNING: Child and parent image functionality is deprecated ---
    --- and should be replaced with sysbuild. Child and parent image ---
    --- support remains only to allow existing customer applications ---
    --- to build and allow porting to sysbuild, it is no longer ---
    --- receiving updates or new features and it will not be possible ---
    --- to build using child/parent image at all in nRF Connect SDK ---
    --- version 2.9 onwards. ---
    ---------------------------------------------------------------------

  • Hi Brian,

    Your understanding is correct. "parent-child" with "--no-sysbuild" are still can be used in NCS 2.9.0 for most samples, but we have stopped maintainace.

    This is the reason we encourage developer to use sysbuild from NCS 2.8.0 and avoid potenional strugglling in the further when then encounter issue with parent-child, and there will be no support avaliable.

    Best regards,

    Charlie

  • Well, I now need to attempt the move to sysbuild anyway, as I need to push the wifi firmware patches out to external XIP flash to get the application to fit in the on-chip flash slot for mcuboot to be able to do DFU with it! Ran into this lovely issue (existing since v1.3.0) due to that:

    NCSDK-20567: When building an application for MCUboot, the build system does not check whether the compiled application is too big for being an update image

  • Hi Charlie

    So, tried to move to 2.9

    NCS 2.9.0 actually has limited changes compared to NCS 2.8.0

    So, installed 2.9.0, updated the toolchain with the toolchain manager, and gave it a try:

    1/ I got a bunch of prj.conf warnings like this one:

    warning: The choice symbol NET_ARP_LOG_LEVEL_ERR (defined at
    subsys/net/Kconfig.template.log_config.net:21) was selected (set =y), but no symbol ended up as the
    choice selection. See docs.zephyrproject.org/.../kconfig.html
    and/or look up NET_ARP_LOG_LEVEL_ERR in the menuconfig/guiconfig interface. The Application
    Development Primer, Setting Configuration Values, and Kconfig - Tips and Best Practices sections of
    the manual might be helpful too.

    For all of these lines in my prj.conf

    CONFIG_NET_CORE_LOG_LEVEL_ERR=y
    CONFIG_NET_PKT_LOG_LEVEL_ERR=y
    CONFIG_NET_IF_LOG_LEVEL_ERR=y
    CONFIG_NET_TC_LOG_LEVEL_ERR=y
    CONFIG_NET_UTILS_LOG_LEVEL_ERR=y
    CONFIG_NET_CONTEXT_LOG_LEVEL_ERR=y
    CONFIG_NET_CONN_LOG_LEVEL_ERR=y
    CONFIG_NET_ROUTE_LOG_LEVEL_ERR=y
    CONFIG_NET_SOCKETS_LOG_LEVEL_ERR=y
    CONFIG_NET_HTTP_LOG_LEVEL_ERR=y
    CONFIG_NET_DHCPV4_LOG_LEVEL_ERR=y
    CONFIG_NET_IPV4_LOG_LEVEL_ERR=y
    CONFIG_NET_TCP_LOG_LEVEL_ERR=y
    CONFIG_NET_UDP_LOG_LEVEL_ERR=y
    CONFIG_NET_ARP_LOG_LEVEL_ERR=y
    CONFIG_NET_L2_WIFI_MGMT_LOG_LEVEL_ERR=y
    CONFIG_NET_MGMT_EVENT_LOG_LEVEL_ERR=y
    CONFIG_DNS_RESOLVER_LOG_LEVEL_ERR=y
    CONFIG_MQTT_LOG_LEVEL_ERR=y
    CONFIG_TLS_CREDENTIALS_LOG_LEVEL_ERR=y
    What is the correct way in 2.9 to set the log level for these modules? (neccessary to get the image to fit in the flash)
    [By the way, I built wifi_sta for ncs 2.8 (for another ticket) : this most basic wifi sample that literally just connects to a wifi access point and nothing else, uses 550kB out of the 1Mb available. This is not great...]
    2/ fatal build error:
    -- Configuring done
    -- Generating done
    -- Build files have been written to: C:/work/dev/if-device-nrf53/cc1-med/build
    ←[92m-- west build: building application
    ←[0mninja: error: '/index', needed by 'zephyr/include/generated/ncs_commit.h', missing and no known rule to make it
    FATAL ERROR: command exited with status 1: 'C:\ncs\toolchains\b620d30767\opt\bin\cmake.EXE' --build 'C:\work\dev\if-device-nrf53\cc1-med\build'
    Any idea what this is and how to fix it?
    And this is before I even TRY to use sysbuild....
Related