nRF7002 randomly unable to connect to Wifi

I have an application that syncs via the internet using a nRF5340 and nRF7002.  I am using the thingy53 with a nRF7002EB for now while the custom circuit boards are being manufactured.  The application connects via wifi every 4 minutes to transmit data.  It then disconnects from the Wifi network to save power between sync intervals.  The application is currently using nRF Connect v3.0.1 and is based on the Azure IoT sample.

The application is in early field testing with around 20 users.  It works well most of the time, but randomly, after 1-2 weeks the system will stop being able to connect to the Wifi network.  A Wifi scan is initiated, but a connection event is never received.  I have also discovered that the system will start working again if I issue the new diagnostic command nrf70 util rpu_recovery_test via a shell connection.

My suspicion is that this is a bug in the nRF7002 where it can somehow get into a semi-unresponsive state.  The only way to recover is to externally reset the nRF7002.  It is difficult to know for sure because this issue takes a week or two to reproduce.  I do have CONFIG_NRF_WIFI_RPU_RECOVERY set, but this doesn't seem to be working.  I am considering updating the application firmware to add a custom nRF7002 watchdog timer that can reset it if it hangs.  However, I wanted to check with the Nordic team to make sure there aren't any known issues or other suggested troubleshooting steps.

Parents
  • Hi,

    Too bad we likely won't be able to get logs. But the issue looks a bit similar to this one. Could you have  a look at my suggestions for that customer?

    Regards,

    Elfving

  • We had a couple more hang ups over the weekend that have some interesting log snippets.  These users recovered by performing a full cold start of the system after running fine for many hours.  I was hoping the wifi ready and RPU recovery would catch/fix these issue automatically without needing a cold start.  I haven't experimented with the propagation delay yet, but please let me know if these errors provide any clues.

    uart:~$ uart:~$ [34:20:16.434,112] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    uart:~$ uart:~$ [34:20:16.434,204] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    uart:~$ uart:~$ [34:20:17.434,112] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    uart:~$ uart:~$ [34:20:17.434,234] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    uart:~$ uart:~$ [34:20:17.434,295] <err> wifi_nrf: nrf_wifi_wpa_supp_scan2: Scan trigger failed
    uart:~$ uart:~$ [34:20:17.434,356] <err> wpa_supp: wpa_drv_zep_scan2: scan2 op failed
    uart:~$ uart:~$ [34:20:18.435,150] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    uart:~$ uart:~$ [34:20:18.435,211] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff

    [07:46:44.967,102] <err> wifi_nrf: hal_rpu_ps_wake: RPU is not ready for more than 1 sec,reg_val = 0x2 rpu_ps_state_mask = 0x6
    [07:46:44.967,193] <err> wifi_nrf: hal_rpu_reg_read: RPU wake failed
    [07:46:44.967,285] <err> wifi_nrf: hal_rpu_hpq_is_empty: Read from dequeue address failed, val (0x0)
    [07:46:44.967,346] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    [07:46:44.967,437] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    [07:46:44.967,498] <err> wifi_nrf: nrf_wifi_wpa_supp_scan_results_get: nrf_wifi_sys_fmac_scan_res_get failed
    [07:46:44.967,559] <err> wpa_supp: wpa_drv_zep_get_scan_results2: get_scan_results2 op failed
    [07:46:45.968,139] <err> wifi_nrf: hal_rpu_ps_wake: RPU is not ready for more than 1 sec,reg_val = 0x2 rpu_ps_state_mask = 0x6
    [07:46:45.968,231] <err> wifi_nrf: hal_rpu_reg_read: RPU wake failed
    [07:46:45.968,353] <err> wifi_nrf: hal_rpu_hpq_is_empty: Read from dequeue address failed, val (0x0)
    [07:46:45.968,444] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    [07:46:45.968,536] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    [07:46:45.968,627] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_sys_fmac_set_power_save failed
    [07:46:46.968,139] <err> wifi_nrf: hal_rpu_ps_wake: RPU is not ready for more than 1 sec,reg_val = 0x2 rpu_ps_state_mask = 0x6
    [07:46:46.968,231] <err> wifi_nrf: hal_rpu_reg_read: RPU wake failed
    [07:46:46.968,353] <err> wifi_nrf: hal_rpu_hpq_is_empty: Read from dequeue address failed, val (0x0)
    [07:46:46.968,444] <err> wifi_nrf: hal_rpu_ready_wait: Timed out waiting (msg_type = 0)
    [07:46:46.968,536] <err> wifi_nrf: hal_rpu_cmd_process_queue: Timeout waiting to get free cmd buff from RPU
    [07:46:57.656,616] <err> wifi_nrf: nrf_wifi_sys_fmac_chg_vif_state: RPU is unresponsive for 10 sec
    [07:46:57.656,677] <err> wifi_nrf: nrf_wifi_if_stop_zep: nrf_wifi_sys_fmac_chg_vif_state failed
    [07:46:57.671,691] <inf> wifi: nRF7002 ready?: no
    [07:46:59.665,191] <inf> wifi_nrf_bus: SPIM spi@a000: freq = 8 MHz
    [07:46:59.665,252] <inf> wifi_nrf_bus: SPIM spi@a000: latency = 0
    [07:47:10.490,234] <err> wifi_nrf: nrf_wifi_sys_fmac_chg_vif_state: RPU is unresponsive for 10 sec
    [07:47:10.490,325] <err> wifi_nrf: nrf_wifi_if_start_zep: nrf_wifi_sys_fmac_chg_vif_state failed
    [07:47:10.492,584] <err> wifi_nrf: nrf_wifi_rpu_recovery_work_handler: rpu_ctx_zep is NULL
    [07:47:10.494,598] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,659] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.494,720] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,781] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.494,812] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,873] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.494,934] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.494,995] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.495,056] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.495,117] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.495,147] <err> wifi_nrf: nrf_wifi_wpa_supp_set_key: rpu_ctx_zep is NULL
    [07:47:10.495,208] <err> wpa_supp: _wpa_drv_zep_set_key: set_key op failed
    [07:47:10.496,734] <inf> wifi: nRF7002 ready?: yes

  • Application details

    • Data transfer is to Azure IoT Hub (TCP/MQTT).  Sync attempts are once every 4 minutes and the payload is around 512 bytes.  As far as I can tell, the error occurs when trying to make a new wifi connection (not during a connection or while syncing data).
    • Issue has been duplicated in multiple environments with multiple routers.  I am using Wifi 6, 2.4 GHz (channel 3) with WPA2-PSK.
    • Yes, the issue can be duplicated on both the nRF7002DK and the nRF5340+nRF7002EK

    I will update the application logging with the RPU information so we have it the next time we reproduce the issue.  Some of the other diagnostic requests are more difficult, but I will look into them.

    Additional update - I created a modified version of the STA sample to see if I can duplicate the issue with shareable code on a DK board.  It hasn't failed yet, but this might be because I also updated the firmware to bring down the Wifi interface (net_if_down()) when not in use or if it can't connect within 60 seconds.  I will test some more, but bringing down the interface might be a good way to work around the issue.

  • Please see the attached sample program and log file that can be used to duplicate the issue using a nRF7002 DK board.  The application continuously connects and disconnects wifi.  If it can't connect for 60 seconds, it tries again and logs the message "Wifi stuck".

    In my test, it took about 6 hours for wifi to get stuck in a loop where it could no longer connect.  I tried printing the rpu stats again, but they were all 0's.  I tried again and got the message "<err> wifi_nrf: nrf_wifi_sys_fmac_stats_get: Stats request already pending".  I tried to do a wifi scan, but no devices were found.  Lastly, I issued the rpu_recovery_test command and things started working again.

    I think the problem is caused by calling net_mgmt(NET_REQUEST_WIFI_CONNECT_STORED, net_if_get_default(), NULL, 0) when a previous connect request is in progress (or failed).  Also, it appears that bringing down the wifi interface between connnect/disconnect cycles avoids the issue.

    wifi.log

     0042.sta.zip

  • Thanks for the the sample program and logs, we are looking into it.

    Regards,

    Elfving

  • Hi again,

    We have been able to reproduce similar issue on an nRF7002 DK after ~1 week of continuous testing on multiple boards. This requires more analysis though.

    What is interesting is that we have not yet been able to reproduce the issue using the sta.zip provided by you yet. If you are able to quickly reproduce, could you try running "nrf70 util rpu_stats_mem" multiple times (with the issue reproduced) and share the logs? That would be great.

    Regards,

    Elfving

  • The customer asked if we could make this ticket private.  We would like to renew focus on this and have some additional information, but wanted to do it in a private ticket.  Is that possible or should I create a new private ticket?

Reply Children
  • Additional update from our end...

    I have deployed a workaround that brings down the wifi interface between connections.  Since I did this 2 weeks ago, I haven't seen any lock ups in our field test units (tested on ~8 devices).  My tentative conclusion from this is that there is a bug in the system that can occur randomly over time after multiple connect/disconnect cycles.  The work around for this is to bring down the wifi interface between connections.

Related