nrf5340/nrf7002 wifi stack in case of problems connecting to AP?

Using nrf5340 with nrf7002 on a custom board, and trying to get a stable wifi operation (including in situations where the wifi AP availability will vary, as this is a mobile device...)

Having experienced problems with the wifi driver ending up 'stuck', I updated from NS 2.6 to 2.8 (what a pain that was). It is more stable now, but still sometimes fails to find my local AP (even without movement!). I get logs like this (even though I do not use the wifi credentials system)
[00:03:23.501,251] <err> wpa_supp: Line 0: invalid key_mgmt 'SAE'
and more relevantly this:
[00:00:29.706,390] <err> wifi_nrf: nrf_wifi_wpa_supp_scan_abort: Timedout waiting for scan abort response, ret = -11
and
[00:01:17.710,052] <err> wpa_supp: wpa_drv_zep_get_scan_results2: Timed out waiting for scan results

Sometimes it ends up connecting even with the wpa_supp logs, but the wifi_nrf one seems to be bad....

After requesting connect I set a timeout (12s) - when this pops, I request a disconnect.
int status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, ctx->iface, NULL, 0);

After 4 attempts that end like this, I attempt to recover by setting the interface down then up again.

// Reset wifi by putting interface down then up
static bool _wifi_reset(struct _netwifi_ctx* ctx) {
  // make interface active
  int ret = 0;
  ret = net_if_down(ctx->iface);
  if (ret==0 || ret==-EALREADY) {
    log_info("netwifi:iface is down!");
    // Wait a little bit
    k_msleep(100);
    ret = net_if_up(ctx->iface);
    if (ret==0 || ret==-EALREADY) {
      log_info("netwifi: iface is up!");
      return true;
    }
    log_warn("netwifi: iface failed to become up (%d)",ret);
  } else {
    log_warn("netwifi: iface failed to become down (%d)",ret);
  }
  return false;
}

This systematically results in a bus fault:

[00:09:23.517,608] <wrn> app: netwifi: connect check timer pops, connect() retry ongoing...
[00:09:35.517,639] <wrn> app: netwifi: connect timeout, too many (4), trying wifi reset
[00:09:46.209,045] <err> wifi_nrf: nrf_wifi_fmac_chg_vif_state: RPU is unresponsive for 10 sec
[00:09:46.218,444] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_fmac_chg_vif_state failed
[00:09:46.229,095] <inf> app: netwifi:iface is down!
[00:09:46.337,493] <inf> wifi_nrf_bus: SPIM spi@a000: freq = 24 MHz
[00:09:46.344,268] <inf> wifi_nrf_bus: SPIM spi@a000: latency = 1
[00:09:46.529,388] <err> wpa_supp: zephyr_get_handle_by_ifname: Unable to get wpa_s handle for wlan0
[00:09:46.539,520] <err> wpa_supp: Interface wlan0 not found
[00:09:46.544,555] <inf> app: netwifi: iface is up!
[00:09:58.530,639] <err> wpa_supp: wpa_drv_zep_scan_timeout: Scan timeout - try to abort it
[00:09:58.539,733] <err> os: ***** BUS FAULT *****
[00:09:58.545,257] <err> os: Precise data bus error
[00:09:58.551,055] <err> os: BFAR Address: 0x11f3ef53
[00:09:58.557,037] <err> os: r0/a1: 0x20059a80 r1/a2: 0x00000000 r2/a3: 0x00000000
[00:09:58.565,826] <err> os: r3/a4: 0x11f3ef47 r12/ip: 0x00000000 r14/lr: 0x00037af9
[00:09:58.574,615] <err> os: xpsr: 0x61000000
[00:09:58.579,895] <err> os: Faulting instruction address (r15/pc): 0x00038588
[00:09:58.587,890] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:09:58.595,886] <err> os: Current thread: 0x200061a8 (unknown)
[00:09:58.602,722] <err> os: Halting system

The debugger says this code in NCS modules/lib/hostap/src/drivers/driver_zephyr.c  is at fault, when it tries to call dev_ops->scan_abort.

static int wpa_drv_zep_abort_scan(void *priv,
   u64 scan_cookie)
{
  struct zep_drv_if_ctx *if_ctx = NULL;
  const struct zep_wpa_supp_dev_ops *dev_ops;
  int ret = -1;

  if_ctx = priv;

  dev_ops = get_dev_ops(if_ctx->dev_ctx);
  if (!dev_ops->scan_abort) {
    wpa_printf(MSG_ERROR,
      "%s: No op registered for scan_abort",
      __func__);
    goto out;
  }

  ret = dev_ops->scan_abort(if_ctx->dev_priv);
out:
  return ret;
}

dev_ops points to a structure where all the pointers are NULL, but I think even that pointer is bad (0x11f3ef47 is neither flash nor RAM?)....

Presumably wpa_supp is trying to abort the scan from the previous connection attempt, but hasn't dealt with the if-down/if-up restart correctly, so is holding on to a device context that is no longer valid...

I note the 'RPU is unresponsive' log... I have CONFIG_NRF_WIFI_RPU_RECOVERY=y as per a prior ticket about the wifi instability on NCS 2.6.x, which is why I updated to NCS 2.8.0....

Q: what could be causing the scan results errors that are causing it to be stuck?

Q: How to correctly stop/restart the wifi interface to recover from it being 'stuck'?

Parents Reply Children
  • and of course the cmake build scripts in 2.8 are pretty direct about it:

    ---------------------------------------------------------------------
    --- WARNING: Child and parent image functionality is deprecated ---
    --- and should be replaced with sysbuild. Child and parent image ---
    --- support remains only to allow existing customer applications ---
    --- to build and allow porting to sysbuild, it is no longer ---
    --- receiving updates or new features and it will not be possible ---
    --- to build using child/parent image at all in nRF Connect SDK ---
    --- version 2.9 onwards. ---
    ---------------------------------------------------------------------

  • Hi Brian,

    Your understanding is correct. "parent-child" with "--no-sysbuild" are still can be used in NCS 2.9.0 for most samples, but we have stopped maintainace.

    This is the reason we encourage developer to use sysbuild from NCS 2.8.0 and avoid potenional strugglling in the further when then encounter issue with parent-child, and there will be no support avaliable.

    Best regards,

    Charlie

  • Well, I now need to attempt the move to sysbuild anyway, as I need to push the wifi firmware patches out to external XIP flash to get the application to fit in the on-chip flash slot for mcuboot to be able to do DFU with it! Ran into this lovely issue (existing since v1.3.0) due to that:

    NCSDK-20567: When building an application for MCUboot, the build system does not check whether the compiled application is too big for being an update image

Related