Using nrf5340 with nrf7002 on a custom board, and trying to get a stable wifi operation (including in situations where the wifi AP availability will vary, as this is a mobile device...)
Having experienced problems with the wifi driver ending up 'stuck', I updated from NS 2.6 to 2.8 (what a pain that was). It is more stable now, but still sometimes fails to find my local AP (even without movement!). I get logs like this (even though I do not use the wifi credentials system)
[00:03:23.501,251] <err> wpa_supp: Line 0: invalid key_mgmt 'SAE'
and more relevantly this:
[00:00:29.706,390] <err> wifi_nrf: nrf_wifi_wpa_supp_scan_abort: Timedout waiting for scan abort response, ret = -11
and
[00:01:17.710,052] <err> wpa_supp: wpa_drv_zep_get_scan_results2: Timed out waiting for scan results
Sometimes it ends up connecting even with the wpa_supp logs, but the wifi_nrf one seems to be bad....
After requesting connect I set a timeout (12s) - when this pops, I request a disconnect.
int status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, ctx->iface, NULL, 0);
After 4 attempts that end like this, I attempt to recover by setting the interface down then up again.
// Reset wifi by putting interface down then up
static bool _wifi_reset(struct _netwifi_ctx* ctx) {
// make interface active
int ret = 0;
ret = net_if_down(ctx->iface);
if (ret==0 || ret==-EALREADY) {
log_info("netwifi:iface is down!");
// Wait a little bit
k_msleep(100);
ret = net_if_up(ctx->iface);
if (ret==0 || ret==-EALREADY) {
log_info("netwifi: iface is up!");
return true;
}
log_warn("netwifi: iface failed to become up (%d)",ret);
} else {
log_warn("netwifi: iface failed to become down (%d)",ret);
}
return false;
}
This systematically results in a bus fault:
[00:09:23.517,608] <wrn> app: netwifi: connect check timer pops, connect() retry ongoing...
[00:09:35.517,639] <wrn> app: netwifi: connect timeout, too many (4), trying wifi reset
[00:09:46.209,045] <err> wifi_nrf: nrf_wifi_fmac_chg_vif_state: RPU is unresponsive for 10 sec
[00:09:46.218,444] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_fmac_chg_vif_state failed
[00:09:46.229,095] <inf> app: netwifi:iface is down!
[00:09:46.337,493] <inf> wifi_nrf_bus: SPIM spi@a000: freq = 24 MHz
[00:09:46.344,268] <inf> wifi_nrf_bus: SPIM spi@a000: latency = 1
[00:09:46.529,388] <err> wpa_supp: zephyr_get_handle_by_ifname: Unable to get wpa_s handle for wlan0
[00:09:46.539,520] <err> wpa_supp: Interface wlan0 not found
[00:09:46.544,555] <inf> app: netwifi: iface is up!
[00:09:58.530,639] <err> wpa_supp: wpa_drv_zep_scan_timeout: Scan timeout - try to abort it
[00:09:58.539,733] <err> os: ***** BUS FAULT *****
[00:09:58.545,257] <err> os: Precise data bus error
[00:09:58.551,055] <err> os: BFAR Address: 0x11f3ef53
[00:09:58.557,037] <err> os: r0/a1: 0x20059a80 r1/a2: 0x00000000 r2/a3: 0x00000000
[00:09:58.565,826] <err> os: r3/a4: 0x11f3ef47 r12/ip: 0x00000000 r14/lr: 0x00037af9
[00:09:58.574,615] <err> os: xpsr: 0x61000000
[00:09:58.579,895] <err> os: Faulting instruction address (r15/pc): 0x00038588
[00:09:58.587,890] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:09:58.595,886] <err> os: Current thread: 0x200061a8 (unknown)
[00:09:58.602,722] <err> os: Halting system
The debugger says this code in NCS modules/lib/hostap/src/drivers/driver_zephyr.c is at fault, when it tries to call dev_ops->scan_abort.
static int wpa_drv_zep_abort_scan(void *priv,
u64 scan_cookie)
{
struct zep_drv_if_ctx *if_ctx = NULL;
const struct zep_wpa_supp_dev_ops *dev_ops;
int ret = -1;
if_ctx = priv;
dev_ops = get_dev_ops(if_ctx->dev_ctx);
if (!dev_ops->scan_abort) {
wpa_printf(MSG_ERROR,
"%s: No op registered for scan_abort",
__func__);
goto out;
}
ret = dev_ops->scan_abort(if_ctx->dev_priv);
out:
return ret;
}
dev_ops points to a structure where all the pointers are NULL, but I think even that pointer is bad (0x11f3ef47 is neither flash nor RAM?)....
Presumably wpa_supp is trying to abort the scan from the previous connection attempt, but hasn't dealt with the if-down/if-up restart correctly, so is holding on to a device context that is no longer valid...
I note the 'RPU is unresponsive' log... I have CONFIG_NRF_WIFI_RPU_RECOVERY=y as per a prior ticket about the wifi instability on NCS 2.6.x, which is why I updated to NCS 2.8.0....
Q: what could be causing the scan results errors that are causing it to be stuck?
Q: How to correctly stop/restart the wifi interface to recover from it being 'stuck'?