nRF7002 randomly unable to connect to Wifi

I have an application that syncs via the internet using a nRF5340 and nRF7002.  I am using the thingy53 with a nRF7002EB for now while the custom circuit boards are being manufactured.  The application connects via wifi every 4 minutes to transmit data.  It then disconnects from the Wifi network to save power between sync intervals.  The application is currently using nRF Connect v3.0.1 and is based on the Azure IoT sample.

The application is in early field testing with around 20 users.  It works well most of the time, but randomly, after 1-2 weeks the system will stop being able to connect to the Wifi network.  A Wifi scan is initiated, but a connection event is never received.  I have also discovered that the system will start working again if I issue the new diagnostic command nrf70 util rpu_recovery_test via a shell connection.

My suspicion is that this is a bug in the nRF7002 where it can somehow get into a semi-unresponsive state.  The only way to recover is to externally reset the nRF7002.  It is difficult to know for sure because this issue takes a week or two to reproduce.  I do have CONFIG_NRF_WIFI_RPU_RECOVERY set, but this doesn't seem to be working.  I am considering updating the application firmware to add a custom nRF7002 watchdog timer that can reset it if it hangs.  However, I wanted to check with the Nordic team to make sure there aren't any known issues or other suggested troubleshooting steps.

Parents Reply Children
  • I have the wifi ready feature implemented, along with the RPU recovery feature.  See wifi.c above.  It seems that the error wasn't detected by the wifi ready and RPU recovery logic.  However, manually issuing the test command to test the RPU recovery via the shell resolved the issue.

    This application does have logging enabled (backed to flash file system).  I also have the shell enabled for extracting the logs.  So, I could add additional logging if you think it is useful.  It might take a week or two for the issue to occur again.  The last time I captured the error, I didn't see any error messages logged.  I could see that a wifi connect request was being issued every 4 minutes (wifi ready), but there was no connection event received.  I also checked the stacks and heap (both looked fine).

    If I add application layer logic to reset the RPU, what is the best API call to use to trigger the reset (similar to what the rpu_recovery_test did)?

  • Sorry about the lack of response from me. I'm a bit uncertain here, so I've forwarded this to the relevant R&D team. I'll let you know once I hear from them. 

    Regards, and have a good week-end,

    Elfving

  • Hi, sorry to hijack the post, but I am trying the thing you recommended in my ticket (the one your forwarded above, three days ago) but I didn't get any response regarding the propagation delay to put. I try 200ms and it's been working for 48h now, but I would like an advice (see my ticket). Also   Maybe you could try to tweak the propagation delay aswell to see if it solve our issue (which is exactly the same). 

    Best regards

  • We had a couple more hang ups over the weekend that have some interesting log snippets.  These users recovered by performing a full cold start of the system after running fine for many hours.  I was hoping the wifi ready and RPU recovery would catch/fix these issue automatically without needing a cold start.  I haven't experimented with the propagation delay yet, but please let me know if these errors provide any clues.

    uart:~$ uart:~$ [34:20:16.434,112] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    uart:~$ uart:~$ [34:20:16.434,204] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    uart:~$ uart:~$ [34:20:17.434,112] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    uart:~$ uart:~$ [34:20:17.434,234] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    uart:~$ uart:~$ [34:20:17.434,295] <err> wifi_nrf: nrf_wifi_wpa_supp_scan2: Scan trigger failed
    uart:~$ uart:~$ [34:20:17.434,356] <err> wpa_supp: wpa_drv_zep_scan2: scan2 op failed
    uart:~$ uart:~$ [34:20:18.435,150] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    uart:~$ uart:~$ [34:20:18.435,211] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff

    [07:46:44.967,102] <err> wifi_nrf: hal_rpu_ps_wake: RPU is not ready for more than 1 sec,reg_val = 0x2 rpu_ps_state_mask = 0x6
    [07:46:44.967,193] <err> wifi_nrf: hal_rpu_reg_read: RPU wake failed
    [07:46:44.967,285] <err> wifi_nrf: hal_rpu_hpq_is_empty: Read from dequeue address failed, val (0x0)
    [07:46:44.967,346] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    [07:46:44.967,437] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    [07:46:44.967,498] <err> wifi_nrf: nrf_wifi_wpa_supp_scan_results_get: nrf_wifi_sys_fmac_scan_res_get failed
    [07:46:44.967,559] <err> wpa_supp: wpa_drv_zep_get_scan_results2: get_scan_results2 op failed
    [07:46:45.968,139] <err> wifi_nrf: hal_rpu_ps_wake: RPU is not ready for more than 1 sec,reg_val = 0x2 rpu_ps_state_mask = 0x6
    [07:46:45.968,231] <err> wifi_nrf: hal_rpu_reg_read: RPU wake failed
    [07:46:45.968,353] <err> wifi_nrf: hal_rpu_hpq_is_empty: Read from dequeue address failed, val (0x0)
    [07:46:45.968,444] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    [07:46:45.968,536] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    [07:46:45.968,627] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_sys_fmac_set_power_save failed
    [07:46:46.968,139] <err> wifi_nrf: hal_rpu_ps_wake: RPU is not ready for more than 1 sec,reg_val = 0x2 rpu_ps_state_mask = 0x6
    [07:46:46.968,231] <err> wifi_nrf: hal_rpu_reg_read: RPU wake failed
    [07:46:46.968,353] <err> wifi_nrf: hal_rpu_hpq_is_empty: Read from dequeue address failed, val (0x0)
    [07:46:46.968,444] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    [07:46:46.968,536] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    [07:46:57.656,616] <err> wifi_nrf: nrf_wifi_sys_fmac_chg_vif_state: RPU is unresponsive for 10 sec
    [07:46:57.656,677] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_sys_fmac_chg_vif_state failed
    [07:46:57.671,691] <inf> wifi: nRF7002 ready?: no
    [07:46:59.665,191] <inf> wifi_nrf_bus: SPIM spi@a000: freq = 8 MHz
    [07:46:59.665,252] <inf> wifi_nrf_bus: SPIM spi@a000: latency = 0
    [07:47:10.490,234] <err> wifi_nrf: nrf_wifi_sys_fmac_chg_vif_state: RPU is unresponsive for 10 sec
    [07:47:10.490,325] <err> wifi_nrf: nrf_wifi_if_start_zep: nrf_wifi_sys_fmac_chg_vif_state failed
    [07:47:10.492,584] <err> wifi_nrf: nrf_wifi_rpu_recovery_work_handler: rpu_ctx_zep is NULL
    [07:47:10.494,598] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,659] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.494,720] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,781] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.494,812] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,873] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.494,934] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,995] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.495,056] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.495,117] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.495,147] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.495,208] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.496,734] <inf> wifi: nRF7002 ready?: yes

  • Got a reply from the relevant team:

    rpu_recovery_test is purely for testing, but the fact that it's helping recover Wi-Fi means that nRF70 is stuck but for some reason the auto-recovery has not kicked in. Could you try to collect stats using nrf70 util rpu_stats_mem? The command needs to be repeated for at least 5 times to see the diff.

    And regarding the code shared in wifi.c, a few comment based on limited understanding:

    • Connection timeout of 500s is excessive, we typically use 60s as the worst case time. Is that a typo?
    • On successful connection, the connection timeout is restarted, and will disconnect after the timeout, is this a desired behaviour? In a typical use case, connection timer should be stopped upon successful connection.
    • From connection timeout handler, no need to issue a disconnect as the connection isn't successful, so, disconnect would be a no-op.
    • Wi-Fi ready doesn't seem to be integrated properly: When the application receives a Wi-Fi not ready event, it has to stop using Wi-Fi (don't send any traffic) and wait for Wi-Fi ready event and trigger a connection explicitly and then start using Wi-Fi on successful connection. The code just check for Wi-Fi readiness only before issue connect, this is incomplete. Please see the event loop for reference (should be customized as per use case).

    Regards,

    Elfving

Related