nRF7002dk nrf5340 bus fault after shutting down interface

Hello,

I am working on a low power sensor logging project using the nRF7002dk. So far, I have updated the main loop in the wifi/sta sample and in place of the:

k_sleep(K_FOREVER);

I have added:

status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, iface, NULL, 0);
status = net_if_down(iface); 
k_sleep(K_SECONDS(2)); // will be longer in production, shortened for testing
status = net_if_up(iface);
k_sleep(K_SECONDS(2)); // allow the interface time to come up

I also have code verifying the statuses are returned as 0, but haven't posted that here to simplify my post.

I find that after a few connections, there is a kernel panic when trying to bring the interface back up.

[00:04:56.560,913] <inf> sta: State: SCANNING
[00:04:56.861,053] <inf> sta: ==================
[00:04:56.861,083] <inf> sta: State: SCANNING
[00:04:57.150,177] <err> os: ***** BUS FAULT *****
[00:04:57.150,177] <err> os: Precise data bus error
[00:04:57.150,207] <err> os: BFAR Address: 0x40000b08
[00:04:57.150,207] <err> os: r0/a1: 0x20000200 r1/a2: 0x20000580 r2/a3: 0x20000289
[00:04:57.150,207] <err> os: r3/a4: 0x20000588 r12/ip: 0x2003cca0 r14/lr: 0x200006a8
[00:04:57.150,238] <err> os: xpsr: 0x01000000
[00:04:57.150,238] <err> os: Faulting instruction address (r15/pc): 0x0003e97a
[00:04:57.150,268] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:04:57.150,299] <err> os: Current thread: 0x20003470 (unknown)

The kernel panics stop if I add a 1 second delay between the net_mgmt and net_if_up calls. 

It appears that I can not shut down the interface immediately after disconnecting or something is not cleaned up that causes issues when bringing the interface back up. Obviously in production code, I would like to replace the delay with a check to ensure the disconnect has completed in case it takes longer than 1 second. How can I determine when it is safe to shut down the interface?

Parents
  • Hi,

    I have added your modifications to the wifi sta sample from NCS v2.4.2. I tested this modified sample on nrf7002-dk board, but I could not reproduce your issue. In my case, the sample behaved as expected.

    If you experience some congestion in your environment, you could try to increase the value of CONFIG_NET_MGMT_EVENT_QUEUE_TIMEOUT in the prj.conf file.

    Best regards,
    Dejan

  • Hello,

    Thank you for looking into this so quickly. I apologize, I forgot to mention that I am using the main branch. I have not tested this issue specifically in the NCS v2.4.2 release.

    Idle power consumption is extremely high in the NCS v2.4.2 release, so it is not suitable for our application.

  • Hello,

    I have already provided my code.

    Here is our project with Bluetooth commented out: https://dl.defelsko.com/downloads/nordic_test_no_ble.zip 

    The only differences between that and what I am running are that my code obviously has an SSID and password, and in http_get.c I've specified the address of an internal test server which posts a PHP endpoint which just returns the posted data.

  • Hi,

    bcornell said:
    I have already provided my code.

    Yes, you are right. We will look into it.

    bcornell said:
    The only differences between that and what I am running are that my code obviously has an SSID and password, and in http_get.c I've specified the address of an internal test server which posts a PHP endpoint which just returns the posted data.

    Thank you for this information.

    Best regards,
    Dejan

  • Hi,

    Could you try to use possible fix given below?

    ---
     overlay-debug.conf | 14 ++++++++++++++
     prj.conf           |  4 ++++
     src/main.c         |  7 +++++++
     3 files changed, 25 insertions(+)
     create mode 100644 overlay-debug.conf
    
    diff --git a/overlay-debug.conf b/overlay-debug.conf
    new file mode 100644
    index 0000000..dd83de0
    --- /dev/null
    +++ b/overlay-debug.conf
    @@ -0,0 +1,14 @@
    +CONFIG_SHELL=y
    +CONFIG_SHELL_BACKEND_SERIAL=y
    +CONFIG_SHELL_STACK_SIZE=4096
    +CONFIG_NET_SHELL=y
    +CONFIG_SHELL_GETOPT=y
    +CONFIG_SHELL_CMDS_RESIZE=n
    +CONFIG_NRF700X_UTIL=y
    +CONFIG_NET_L2_WIFI_SHELL=y
    +CONFIG_NET_STATISTICS=y
    +CONFIG_NET_STATISTICS_WIFI=y
    +CONFIG_NET_STATISTICS_USER_API=y
    +CONFIG_SYS_HEAP_RUNTIME_STATS=y
    +# Enable for debugging connection issues
    +# CONFIG_WPA_SUPP_LOG_LEVEL_DBG=y
    diff --git a/prj.conf b/prj.conf
    index 9b91578..32695f2 100644
    --- a/prj.conf
    +++ b/prj.conf
    @@ -107,3 +107,7 @@ CONFIG_NET_HTTP_LOG_LEVEL_DBG=y
     CONFIG_THREAD_NAME=y
     
     CONFIG_RESET_ON_FATAL_ERROR=n
    +
    +
    +# debugging
    +CONFIG_SHELL_STACK_SIZE=4096
    diff --git a/src/main.c b/src/main.c
    index d9bcfe7..a3a76aa 100644
    --- a/src/main.c
    +++ b/src/main.c
    @@ -416,6 +416,12 @@ int wifi_poweron(struct net_if *iface)
     
     //~ #define BT_LE_ADV_CONN_DEF BT_LE_ADV_PARAM(BT_LE_ADV_OPT_CONNECTABLE, 0x0640, 0x0680, NULL)
     
    +void dump_rpu_stats(void)
    +{
    +	shell_execute_cmd(shell_backend_uart_get_ptr(), "wifi_util tx_stats 0");
    +	shell_execute_cmd(shell_backend_uart_get_ptr(), "wifi_util rpu_stats all");
    +}
    +
     int main(void)
     {
     	//volatile unsigned int *myPointer = (volatile unsigned int *) 0x5002B500;
    @@ -505,6 +511,7 @@ int main(void)
     			while (!have_ip) {
     				attempt++;		
     				printf("%d\n",attempt);
    +				dump_rpu_stats();
     				if (attempt==200) {
     					printf("No ip.\n");
     					err=-99;
    -- 
    

    Please make sure that you build with overlay-debug.conf by providing extra argument    -DOVERLAY_CONFIG=overlay-debug.conf. If the issue is still present, please provide full logs, and elf files.

    Best regards,
    Dejan

  • Hello,

    I have tested the changes you provided, they made no difference to the behavior. They all appear to be related to adding more debug output.

    I tested with both access points and have uploaded my elf files and logs here: dl.defelsko.com/.../nordic_logs_nov1.zip

  • Hi,

    Thank you for testing and for providing required files.
    We will look into it. I will get back to you with new information as soon as possible.

    Best regards,
    Dejan

Reply Children
  • Hi,

    Could you provide sniffer trace for Wi-Fi 4 communication?

    In wi-fi 4 case, what is the criterion for OK/FAIL? Are OK and FAIL status codes from httpbin.org?

    Best regards,
    Dejan

  • In wi-fi 4 case, what is the criterion for OK/FAIL? Are OK and FAIL status codes from httpbin.org?

    We are not using httpbin.org as I mentioned previously:

    I've specified the address of an internal test server which posts a PHP endpoint which just returns the posted data.

    I added httpbin.org to the code provided so that you could see how the http post code functions. You will likely need a PHP endpoint on your own server to test this, I found that httpbin.org appears to have DoS protections that prevent this test from working continuously.

    I do not know what the criterion for OK/FAIL are. That is why I posted in the DevZone asking for help. I assume they come from somewhere in your driver code or is a message being passed through from the modem?

    Has any progress been made on the initial WiFi 6 issue where an MPU fault occurs when you disconnect & bring the interface down while to obtain an IP address? Based on the logs, it appears that the problem there is that once I give up waiting for an IP and try to disconnect and bring the interface down, the MPU fault occurs. I assume that is related to background processing not being cleaned up? 

  • Hi,

    bcornell said:
    I do not know what the criterion for OK/FAIL are. That is why I posted in the DevZone asking for help. I assume they come from somewhere in your driver code or is a message being passed through from the modem?

    Based on your comment, is it correct that OK/FAIL is not coming from your internal server? 

    bcornell said:
    Has any progress been made on the initial WiFi 6 issue where an MPU fault occurs when you disconnect & bring the interface down while to obtain an IP address? Based on the logs, it appears that the problem there is that once I give up waiting for an IP and try to disconnect and bring the interface down, the MPU fault occurs. I assume that is related to background processing not being cleaned up? 

    This is still work in progress.

    dejans said:
    Could you provide sniffer trace for Wi-Fi 4 communication?

    For further debugging we would need sniffer trace of wi-fi 4 communication. Could you provide it?

    Best regards,
    Dejan

  • Based on your comment, is it correct that OK/FAIL is not coming from your internal server? 

    That is correct. 

    For further debugging we would need sniffer trace of wi-fi 4 communication. Could you provide it?

    I will work on gathering that today and will provide the sniffer trace and console logs once they are available.

  • For further debugging we would need sniffer trace of wi-fi 4 communication. Could you provide it?

    Here ( dl.defelsko.com/.../nordic_logs_nov2.zip ) is the Wireshark capture and log output from a test I ran today. The board main loop stopped about 5 minutes before I noticed and stopped the Wireshark capture. The code I ran is the same as yesterday so the ELF file is available in that download if needed.

    The only device on the "dlink" network during testing was the Nordic development board. 

Related