nRF7002dk nrf5340 bus fault after shutting down interface

Hello,

I am working on a low power sensor logging project using the nRF7002dk. So far, I have updated the main loop in the wifi/sta sample and in place of the:

k_sleep(K_FOREVER);

I have added:

status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, iface, NULL, 0);
status = net_if_down(iface);
k_sleep(K_SECONDS(2)); // will be longer in production, shortened for testing status = net_if_up(iface); k_sleep(K_SECONDS(2)); // allow the interface time to come up

I also have code verifying the statuses are returned as 0, but haven't posted that here to simplify my post.

I find that after a few connections, there is a kernel panic when trying to bring the interface back up.

[00:04:56.560,913] <inf> sta: State: SCANNING
[00:04:56.861,053] <inf> sta: ==================
[00:04:56.861,083] <inf> sta: State: SCANNING
[00:04:57.150,177] <err> os: ***** BUS FAULT *****
[00:04:57.150,177] <err> os: Precise data bus error
[00:04:57.150,207] <err> os: BFAR Address: 0x40000b08
[00:04:57.150,207] <err> os: r0/a1: 0x20000200 r1/a2: 0x20000580 r2/a3: 0x20000289
[00:04:57.150,207] <err> os: r3/a4: 0x20000588 r12/ip: 0x2003cca0 r14/lr: 0x200006a8
[00:04:57.150,238] <err> os: xpsr: 0x01000000
[00:04:57.150,238] <err> os: Faulting instruction address (r15/pc): 0x0003e97a
[00:04:57.150,268] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:04:57.150,299] <err> os: Current thread: 0x20003470 (unknown)

The kernel panics stop if I add a 1 second delay between the net_mgmt and net_if_up calls.

It appears that I can not shut down the interface immediately after disconnecting or something is not cleaned up that causes issues when bringing the interface back up. Obviously in production code, I would like to replace the delay with a check to ensure the disconnect has completed in case it takes longer than 1 second. How can I determine when it is safe to shut down the interface?

Parents

0 dejans over 1 year ago

Hi,

I have added your modifications to the wifi sta sample from NCS v2.4.2. I tested this modified sample on nrf7002-dk board, but I could not reproduce your issue. In my case, the sample behaved as expected.

If you experience some congestion in your environment, you could try to increase the value of CONFIG_NET_MGMT_EVENT_QUEUE_TIMEOUT in the prj.conf file.

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bcornell over 1 year ago in reply to dejans

Hello,

Thank you for looking into this so quickly. I apologize, I forgot to mention that I am using the main branch. I have not tested this issue specifically in the NCS v2.4.2 release.

Idle power consumption is extremely high in the NCS v2.4.2 release, so it is not suitable for our application.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dejans over 1 year ago in reply to dejans

Hi,

Could you please have a look at this pull request and let me know if this commit fixes your issue?

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bcornell over 1 year ago in reply to dejans

Hello,

I rebased, ensuring that the commit changes were included and am able to confirm that the commit does fix the issue I was seeing. I was able to connect & disconnect without crashes around an hour.

However, I tested this fix in my main project which I've uploaded here: https://dl.defelsko.com/downloads/nordic_test.zip

After about an hour I got an error:

[2023-10-19 10:19:11] 200
[2023-10-19 10:19:11] No ip.
[2023-10-19 10:19:11] OK
[2023-10-19 10:19:11] [00:48:05.573,913] <inf> sta: Disconnection request done (0)
[2023-10-19 10:19:11] [00:48:05.580,200] <inf> sta: ==================
[2023-10-19 10:19:11] [00:48:05.580,230] <inf> sta: State: DISCONNECTED
[2023-10-19 10:19:11] [00:48:05.580,444] <inf> sta: Disconnect requested
[2023-10-19 10:19:11] disconnect:0
[2023-10-19 10:19:11] [00:48:05.598,815] <err> wifi_nrf: nrf_wifi_hal_buf_map_tx: Called for already mapped TX buffer
[2023-10-19 10:19:11]
[2023-10-19 10:19:11] [00:48:05.598,846] <err> os: ***** MPU FAULT *****
[2023-10-19 10:19:11] [00:48:05.598,846] <err> os: Data Access Violation
[2023-10-19 10:19:11] [00:48:05.598,846] <err> os: MMFAR Address: 0xd0
[2023-10-19 10:19:11] [00:48:05.598,876] <err> os: r0/a1: 0x00009879 r1/a2: 0x00000000 r2/a3: 0x00000006
[2023-10-19 10:19:11] [00:48:05.598,876] <err> os: r3/a4: 0x00000001 r12/ip: 0x2000270b r14/lr: 0x00035393
[2023-10-19 10:19:11] [00:48:05.598,907] <err> os: xpsr: 0x01000000
[2023-10-19 10:19:11] [00:48:05.598,907] <err> os: Faulting instruction address (r15/pc): 0x000353bc
[2023-10-19 10:19:11] [00:48:05.598,937] <err> os: >>> ZEPHYR FATAL ERROR 19: Unknown error on CPU 0
[2023-10-19 10:19:11] [00:48:05.598,968] <err> os: Current thread: 0x20003540 (unknown)
[2023-10-19 10:19:11] [00:48:05.693,786] <err> coredump: #CD:BEGIN#
[2023-10-19 10:19:11] [00:48:05.698,913] <err> coredump: #CD:5a4501000300050013000000
[2023-10-19 10:19:11] [00:48:05.705,566] <err> coredump: #CD:4102004400
[2023-10-19 10:19:11] [00:48:05.711,029] <err> coredump: #CD:799800000000000006000000010000000b27002093530300bc53030000000001
[2023-10-19 10:19:11] [00:48:05.721,191] <err> coredump: #CD:a018012000000000000000000000000000000000000000000000000000000000
[2023-10-19 10:19:11] [00:48:05.731,353] <err> coredump: #CD:00000000
[2023-10-19 10:19:11] [00:48:05.736,633] <err> coredump: #CD:4d01004035002008360020
[2023-10-19 10:19:11] [00:48:05.743,133] <err> coredump: #CD:e0380020705e0020000000000080000000000000000000000000000000000000
[2023-10-19 10:19:11] [00:48:05.753,295] <err> coredump: #CD:b5a20600000000000000000000000000ffffffff10000000e45d0020e4100020
[2023-10-19 10:19:11] [00:48:05.763,488] <err> coredump: #CD:64788c0084100320c0000000647800b028180120000000009835002098350020
[2023-10-19 10:19:11] [00:48:05.773,651] <err> coredump: #CD:0000000000000000000000000000000000000000000000000000000000000000
[2023-10-19 10:19:11] [00:48:05.783,843] <err> coredump: #CD:0000000000000000000000000000000000000000000000005812012000080000
[2023-10-19 10:19:11] [00:48:05.794,006] <err> coredump: #CD:000000008c1500200000000000000000f8180120000000004035002000000000
[2023-10-19 10:19:11] [00:48:05.804,168] <err> coredump: #CD:0000000000000000
[2023-10-19 10:19:11] [00:48:05.810,150] <err> coredump: #CD:4d010058120120581a0120
[2023-10-19 10:19:11] [00:48:05.816,650] <err> coredump: #CD:f0f0f0f0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
[2023-10-19 10:19:11] [00:48:05.826,812] <err> coredump: #CD:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
[2023-10-19 10:19:11] [00:48:05.837,005] <err> coredump: #CD:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

This project includes code from the wifi/sta, net/sockets/http_get, and bluetooth/peripheral_lbs samples. Most of the bluetooth code has been removed, but it does advertise, which I have been using when logging is off to ensure the code is still running and has not crashed. If you set the name of an access point and password in the configuration, this code should connect to wifi, hit the HTTP end point with an incrementing value, disconnect and shut down wifi, pause and bring the interface back up and reconnect every few seconds. You'll see I have commented out tracks of code where I ran the http_get portion of code in an endless loop to ensure it didn't have any memory leaks.

The issue I was seeing has been fixed, but I am still unable to achieve stable operation with this mode of operation.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dejans over 1 year ago in reply to bcornell

Hi,

Applications can be developed with wireless coexistence. You could have a look at Wi-Fi Bluetooth LE coexistence and its documentation.

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bcornell over 1 year ago in reply to dejans

Hello,

To rule out any issues related to BLE or to wireless coexistence we have commented out the Bluetooth code in our project. It failed with exactly the same error message as provided above. We do not think wireless coexistence is related to the crashes because it still crashes even with Bluetooth disabled.

Here is our project with Bluetooth commented out: https://dl.defelsko.com/downloads/nordic_test_no_ble.zip
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dejans over 1 year ago in reply to bcornell

Hi,

I have tested your project for little more than 1 hour and could not reproduce your issue.

With a purpose of reproducing the issue, it would be useful if you could test multiple times with the same board and also use another board if you have it available for testing. You should also consider connecting to another access point to check if the issue appears there as well.

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 dejans over 1 year ago in reply to bcornell

Hi,

I have tested your project for little more than 1 hour and could not reproduce your issue.

With a purpose of reproducing the issue, it would be useful if you could test multiple times with the same board and also use another board if you have it available for testing. You should also consider connecting to another access point to check if the issue appears there as well.

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 bcornell over 1 year ago in reply to dejans

Hello,

The error we have been getting when testing with our main development board and new WiFi 6 access point occurs after 2 or 3 hours and we see the message I posted previously. We have tested this 5 or 6 times with the same result every time.

Today, as suggested, we tested on a different development board and with a different (older WiFi 4) access point.

After 3 hours, it stopped operating with this message:

[2023-10-25 14:29:12] addrinfo @0x2006c038: ai_family=1, ai_socktype=1, ai_protocol=6, sa_family=1, sin_port=5000
[2023-10-25 14:29:12] sock = 9
[2023-10-25 14:29:12] OK
[2023-10-25 14:29:12] FAIL

We restarted the development board and after 3 hours got the same message.

Neither board has run for 4 continuous hours without crashing, but they always run longer than an hour and a half.

Switching routers/development boards does not fix the problem.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dejans over 1 year ago in reply to bcornell

Hi,

Thank you for this update and additional information.

Which access points did you use (manufacturer and version) and which band did you use for your testing?

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bcornell over 1 year ago in reply to dejans

We have an old D-Link DIR-601B1 which was tested at 2.4GHz and a mesh network using multiple Cisco Business 150AX Access Points which support both 2.4GHz and 5GHz. The development board connects to that network using the 5Ghz band.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 dejans over 1 year ago in reply to bcornell

Hi,

Could you please provide a full crash log in both cases (old and new access point)?
In addition, could you provide elf files from your project's zephyr folder in both cases?

Best regards,
Dejan
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 bcornell over 1 year ago in reply to dejans

Hello,

Today, I tested both cases on the newer board and logged each time.

I noticed that in each instance of the first error case (WiFi 6 MPU Fault) it happens immediately after failing to obtain an IP address. I do not remember seeing that when testing previously on the other board.

The WiFi 4 case behaved exactly as I saw last week It worked great for a period of time before stopping without any type of indication as to what the problem is.

I have uploaded my logs and ELF files here: dl.defelsko.com/.../nordic_logs.zip
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel