nrf5340/nrf7002 wifi stack in case of problems connecting to AP?

Using nrf5340 with nrf7002 on a custom board, and trying to get a stable wifi operation (including in situations where the wifi AP availability will vary, as this is a mobile device...)

Having experienced problems with the wifi driver ending up 'stuck', I updated from NS 2.6 to 2.8 (what a pain that was). It is more stable now, but still sometimes fails to find my local AP (even without movement!). I get logs like this (even though I do not use the wifi credentials system)
[00:03:23.501,251] <err> wpa_supp: Line 0: invalid key_mgmt 'SAE'
and more relevantly this:
[00:00:29.706,390] <err> wifi_nrf: nrf_wifi_wpa_supp_scan_abort: Timedout waiting for scan abort response, ret = -11
and
[00:01:17.710,052] <err> wpa_supp: wpa_drv_zep_get_scan_results2: Timed out waiting for scan results

Sometimes it ends up connecting even with the wpa_supp logs, but the wifi_nrf one seems to be bad....

After requesting connect I set a timeout (12s) - when this pops, I request a disconnect.
int status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, ctx->iface, NULL, 0);

After 4 attempts that end like this, I attempt to recover by setting the interface down then up again.

// Reset wifi by putting interface down then up
static bool _wifi_reset(struct _netwifi_ctx* ctx) {
  // make interface active
  int ret = 0;
  ret = net_if_down(ctx->iface);
  if (ret==0 || ret==-EALREADY) {
    log_info("netwifi:iface is down!");
    // Wait a little bit
    k_msleep(100);
    ret = net_if_up(ctx->iface);
    if (ret==0 || ret==-EALREADY) {
      log_info("netwifi: iface is up!");
      return true;
    }
    log_warn("netwifi: iface failed to become up (%d)",ret);
  } else {
    log_warn("netwifi: iface failed to become down (%d)",ret);
  }
  return false;
}

This systematically results in a bus fault:

[00:09:23.517,608] <wrn> app: netwifi: connect check timer pops, connect() retry ongoing...
[00:09:35.517,639] <wrn> app: netwifi: connect timeout, too many (4), trying wifi reset
[00:09:46.209,045] <err> wifi_nrf: nrf_wifi_fmac_chg_vif_state: RPU is unresponsive for 10 sec
[00:09:46.218,444] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_fmac_chg_vif_state failed
[00:09:46.229,095] <inf> app: netwifi:iface is down!
[00:09:46.337,493] <inf> wifi_nrf_bus: SPIM spi@a000: freq = 24 MHz
[00:09:46.344,268] <inf> wifi_nrf_bus: SPIM spi@a000: latency = 1
[00:09:46.529,388] <err> wpa_supp: zephyr_get_handle_by_ifname: Unable to get wpa_s handle for wlan0
[00:09:46.539,520] <err> wpa_supp: Interface wlan0 not found
[00:09:46.544,555] <inf> app: netwifi: iface is up!
[00:09:58.530,639] <err> wpa_supp: wpa_drv_zep_scan_timeout: Scan timeout - try to abort it
[00:09:58.539,733] <err> os: ***** BUS FAULT *****
[00:09:58.545,257] <err> os: Precise data bus error
[00:09:58.551,055] <err> os: BFAR Address: 0x11f3ef53
[00:09:58.557,037] <err> os: r0/a1: 0x20059a80 r1/a2: 0x00000000 r2/a3: 0x00000000
[00:09:58.565,826] <err> os: r3/a4: 0x11f3ef47 r12/ip: 0x00000000 r14/lr: 0x00037af9
[00:09:58.574,615] <err> os: xpsr: 0x61000000
[00:09:58.579,895] <err> os: Faulting instruction address (r15/pc): 0x00038588
[00:09:58.587,890] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:09:58.595,886] <err> os: Current thread: 0x200061a8 (unknown)
[00:09:58.602,722] <err> os: Halting system

The debugger says this code in NCS modules/lib/hostap/src/drivers/driver_zephyr.c  is at fault, when it tries to call dev_ops->scan_abort.

static int wpa_drv_zep_abort_scan(void *priv,
   u64 scan_cookie)
{
  struct zep_drv_if_ctx *if_ctx = NULL;
  const struct zep_wpa_supp_dev_ops *dev_ops;
  int ret = -1;

  if_ctx = priv;

  dev_ops = get_dev_ops(if_ctx->dev_ctx);
  if (!dev_ops->scan_abort) {
    wpa_printf(MSG_ERROR,
      "%s: No op registered for scan_abort",
      __func__);
    goto out;
  }

  ret = dev_ops->scan_abort(if_ctx->dev_priv);
out:
  return ret;
}

dev_ops points to a structure where all the pointers are NULL, but I think even that pointer is bad (0x11f3ef47 is neither flash nor RAM?)....

Presumably wpa_supp is trying to abort the scan from the previous connection attempt, but hasn't dealt with the if-down/if-up restart correctly, so is holding on to a device context that is no longer valid...

I note the 'RPU is unresponsive' log... I have CONFIG_NRF_WIFI_RPU_RECOVERY=y as per a prior ticket about the wifi instability on NCS 2.6.x, which is why I updated to NCS 2.8.0....

Q: what could be causing the scan results errors that are causing it to be stuck?

Q: How to correctly stop/restart the wifi interface to recover from it being 'stuck'?

Parents Reply Children
Related