nRF7002dk nrf5340 bus fault after shutting down interface

Hello,

I am working on a low power sensor logging project using the nRF7002dk. So far, I have updated the main loop in the wifi/sta sample and in place of the:

k_sleep(K_FOREVER);

I have added:

status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, iface, NULL, 0);
status = net_if_down(iface); 
k_sleep(K_SECONDS(2)); // will be longer in production, shortened for testing
status = net_if_up(iface);
k_sleep(K_SECONDS(2)); // allow the interface time to come up

I also have code verifying the statuses are returned as 0, but haven't posted that here to simplify my post.

I find that after a few connections, there is a kernel panic when trying to bring the interface back up.

[00:04:56.560,913] <inf> sta: State: SCANNING
[00:04:56.861,053] <inf> sta: ==================
[00:04:56.861,083] <inf> sta: State: SCANNING
[00:04:57.150,177] <err> os: ***** BUS FAULT *****
[00:04:57.150,177] <err> os: Precise data bus error
[00:04:57.150,207] <err> os: BFAR Address: 0x40000b08
[00:04:57.150,207] <err> os: r0/a1: 0x20000200 r1/a2: 0x20000580 r2/a3: 0x20000289
[00:04:57.150,207] <err> os: r3/a4: 0x20000588 r12/ip: 0x2003cca0 r14/lr: 0x200006a8
[00:04:57.150,238] <err> os: xpsr: 0x01000000
[00:04:57.150,238] <err> os: Faulting instruction address (r15/pc): 0x0003e97a
[00:04:57.150,268] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:04:57.150,299] <err> os: Current thread: 0x20003470 (unknown)

The kernel panics stop if I add a 1 second delay between the net_mgmt and net_if_up calls. 

It appears that I can not shut down the interface immediately after disconnecting or something is not cleaned up that causes issues when bringing the interface back up. Obviously in production code, I would like to replace the delay with a check to ensure the disconnect has completed in case it takes longer than 1 second. How can I determine when it is safe to shut down the interface?

Parents
  • Hi,

    I have added your modifications to the wifi sta sample from NCS v2.4.2. I tested this modified sample on nrf7002-dk board, but I could not reproduce your issue. In my case, the sample behaved as expected.

    If you experience some congestion in your environment, you could try to increase the value of CONFIG_NET_MGMT_EVENT_QUEUE_TIMEOUT in the prj.conf file.

    Best regards,
    Dejan

  • Hello,

    Thank you for looking into this so quickly. I apologize, I forgot to mention that I am using the main branch. I have not tested this issue specifically in the NCS v2.4.2 release.

    Idle power consumption is extremely high in the NCS v2.4.2 release, so it is not suitable for our application.

  • Based on your comment, is it correct that OK/FAIL is not coming from your internal server? 

    That is correct. 

    For further debugging we would need sniffer trace of wi-fi 4 communication. Could you provide it?

    I will work on gathering that today and will provide the sniffer trace and console logs once they are available.

  • For further debugging we would need sniffer trace of wi-fi 4 communication. Could you provide it?

    Here ( dl.defelsko.com/.../nordic_logs_nov2.zip ) is the Wireshark capture and log output from a test I ran today. The board main loop stopped about 5 minutes before I noticed and stopped the Wireshark capture. The code I ran is the same as yesterday so the ELF file is available in that download if needed.

    The only device on the "dlink" network during testing was the Nordic development board. 

  • Hi,

    Thank you for providing additional files. 
    We will continue looking into your reported issues. I expect to get back to you with new information during next week.

    Best regards,
    Dejan

  • Hi, 

    I am sorry for delayed reply. I was out of the office.

    With regard to your Wi-Fi 4 issue, we have done some testing using wi-fi shell sample and there were no issues on our side.

    Regarding Wi-Fi 6 problems, you could try the following pull requests - data vs control path races (which should resolve the Wi-Fi crash) and nrf70 patch update (for resolving data stall). As Wi-Fi 6 data stall (no IP) fix has been merged into the newest NCS, you could also try to checkout the latest NCS.

    Best regards,
    Dejan

  • Hello,

    Thank you for getting back to me.

    I have rebased using the latest development branch and the behavior has changed. After 15 minutes, on wifi 6, I get an endless stream of State:Scanning messages
    [2023-11-20 14:12:16] [02:34:24.108,917] <inf> sta: State: SCANNING
    [2023-11-20 14:12:16] [02:34:24.409,057] <inf> sta: ==================
    [2023-11-20 14:12:16] [02:34:24.409,088] <inf> sta: State: SCANNING
    [2023-11-20 14:12:16] [02:34:24.709,197] <inf> sta: ==================
    [2023-11-20 14:12:16] [02:34:24.709,228] <inf> sta: State: SCANNING
    [2023-11-20 14:12:16] [02:34:25.009,338] <inf> sta: ==================
    [2023-11-20 14:12:16] [02:34:25.009,368] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:25.309,478] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:25.309,509] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:25.610,168] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:25.610,198] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:25.910,339] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:25.910,369] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:26.210,479] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:26.210,510] <inf> sta: State: SCANNING
    [2023-11-20 14:12:18] [02:34:26.510,620] <inf> sta: ==================
    [2023-11-20 14:12:18] [02:34:26.510,650] <inf> sta: State: SCANNING
    [2023-11-20 14:12:18] [02:34:26.810,791] <inf> sta: ==================
    [2023-11-20 14:12:18] [02:34:26.810,821] <inf> sta: State: SCANNING

    I tested multiple times today and the issue was very repeatable.

    What are we doing wrong? Other customers must be using the nrf7002. In our testing, the drivers are completely unstable.

    With regard to your Wi-Fi 4 issue, we have done some testing using wi-fi shell sample and there were no issues on our side.

    Could you elaborate on how you tested? Our use case seems like an extremely basic test case:
    * connect to wifi
    * get http end point
    * disconnect from wifi
    * power off wifi
    * wait for period of time
    * power on wifi
    * repeat
    Do you have any suggestions on how to prove that this mode of operation is possible for extended periods of time?

Reply
  • Hello,

    Thank you for getting back to me.

    I have rebased using the latest development branch and the behavior has changed. After 15 minutes, on wifi 6, I get an endless stream of State:Scanning messages
    [2023-11-20 14:12:16] [02:34:24.108,917] <inf> sta: State: SCANNING
    [2023-11-20 14:12:16] [02:34:24.409,057] <inf> sta: ==================
    [2023-11-20 14:12:16] [02:34:24.409,088] <inf> sta: State: SCANNING
    [2023-11-20 14:12:16] [02:34:24.709,197] <inf> sta: ==================
    [2023-11-20 14:12:16] [02:34:24.709,228] <inf> sta: State: SCANNING
    [2023-11-20 14:12:16] [02:34:25.009,338] <inf> sta: ==================
    [2023-11-20 14:12:16] [02:34:25.009,368] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:25.309,478] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:25.309,509] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:25.610,168] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:25.610,198] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:25.910,339] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:25.910,369] <inf> sta: State: SCANNING
    [2023-11-20 14:12:17] [02:34:26.210,479] <inf> sta: ==================
    [2023-11-20 14:12:17] [02:34:26.210,510] <inf> sta: State: SCANNING
    [2023-11-20 14:12:18] [02:34:26.510,620] <inf> sta: ==================
    [2023-11-20 14:12:18] [02:34:26.510,650] <inf> sta: State: SCANNING
    [2023-11-20 14:12:18] [02:34:26.810,791] <inf> sta: ==================
    [2023-11-20 14:12:18] [02:34:26.810,821] <inf> sta: State: SCANNING

    I tested multiple times today and the issue was very repeatable.

    What are we doing wrong? Other customers must be using the nrf7002. In our testing, the drivers are completely unstable.

    With regard to your Wi-Fi 4 issue, we have done some testing using wi-fi shell sample and there were no issues on our side.

    Could you elaborate on how you tested? Our use case seems like an extremely basic test case:
    * connect to wifi
    * get http end point
    * disconnect from wifi
    * power off wifi
    * wait for period of time
    * power on wifi
    * repeat
    Do you have any suggestions on how to prove that this mode of operation is possible for extended periods of time?

Children
Related