We are currently experiencing an issue with the nRF7002 that appears to be present when coexistence is enabled: The nRF7002 becomes unresponsive, seemingly at random, multiple times a day. This issue occurs on both our custom hardware as well the nRF7002DK running our application. With coexistence disabled, I have yet to see this issue occur.
In my logs, I see that the RPU goes to sleep but doesn't return until after CONFIG_NRF_WIFI_RPU_RECOVERY_PS_ACTIVE_TIMEOUT_MS has elapsed when the watchdog triggers an RPU recovery.
[01:26:22.920,410] <inf> wifi_nrf: hal_rpu_ps_wake: RPU PS state is AWAKE [01:26:22.931,304] <inf> wifi_nrf: hal_rpu_ps_sleep: RPU PS state is ASLEEP [01:26:23.034,851] <inf> wifi_nrf: hal_rpu_ps_wake: RPU PS state is AWAKE [01:26:23.045,562] <inf> wifi_nrf: hal_rpu_ps_sleep: RPU PS state is ASLEEP [01:26:23.055,572] <inf> wifi_nrf: hal_rpu_ps_wake: RPU PS state is AWAKE [01:26:23.066,467] <inf> wifi_nrf: hal_rpu_ps_sleep: RPU PS state is ASLEEP ... ... ... [01:27:12.849,395] <inf> wifi_nrf: hal_rpu_ps_wake: RPU PS state is AWAKE [01:27:12.850,463] <inf> wifi_nrf: Received watchdog interrupt [01:27:12.850,952] <inf> wifi_nrf: Processing watchdog interrupt [01:27:12.851,501] <inf> wifi_nrf: RPU sleep opp diff: 49785 ms, last RPU sleep opp time: 5183066 [01:27:12.852,142] <inf> wifi_nrf: RPU recovery needed [01:27:12.853,424] <err> wifi_nrf: nrf_wifi_rpu_recovery_work_handler: Starting RPU recovery [01:27:12.854,064] <err> wifi_nrf: nrf_wifi_rpu_recovery_work_handler: Bringing the interface down
Sometimes this happens eight minutes after powering on, other times it can happen after 30 minutes. In this instance it took nearly 1.5 hours. I have tried preventing the nRF7002 from going to sleep with CONFIG_NRF_WIFI_LOW_POWER=n. I reduced CONFIG_BT_CTLR_SDC_MAX_CONN_EVENT_LEN_DEFAULT from 7500 to 4000. Nothing I've done so far seems to prevent this issue from occurring while coexistence is enabled. While I can tune the watchdog timeout value to minimise the time that the nRF7002 is "offline", this is not the ideal solution given this issue happens so frequently.
The current use case of our application involves scanning for BLE peripherals while being connected to a peripheral that is sending one small packet of data once a second. This data is sent to our backend via a TLS connection as well as to a device connected locally via a TCP connection. Our application is also broadcasting small packets via UDP once a second on the local network and has a TCP socket listener open.
We are using NCS v3.1.1 but upgraded to v3.3.0 to see if the issue still exists, which it does. Coexistence is configured as described in Nordic's documentation but for 4-wire mode. Our application is not overloading the system (plenty of idle time). BLE scanning is configured for 100ms interval with 20ms window. Our hardware is using the shared antenna configuration.
Have I missed something? I'm all out of ideas at the moment.
Regards,
Jon