nRF7002dk nrf5340 bus fault after shutting down interface

Hello,

I am working on a low power sensor logging project using the nRF7002dk. So far, I have updated the main loop in the wifi/sta sample and in place of the:

k_sleep(K_FOREVER);

I have added:

status = net_mgmt(NET_REQUEST_WIFI_DISCONNECT, iface, NULL, 0);
status = net_if_down(iface); 
k_sleep(K_SECONDS(2)); // will be longer in production, shortened for testing
status = net_if_up(iface);
k_sleep(K_SECONDS(2)); // allow the interface time to come up

I also have code verifying the statuses are returned as 0, but haven't posted that here to simplify my post.

I find that after a few connections, there is a kernel panic when trying to bring the interface back up.

[00:04:56.560,913] <inf> sta: State: SCANNING
[00:04:56.861,053] <inf> sta: ==================
[00:04:56.861,083] <inf> sta: State: SCANNING
[00:04:57.150,177] <err> os: ***** BUS FAULT *****
[00:04:57.150,177] <err> os: Precise data bus error
[00:04:57.150,207] <err> os: BFAR Address: 0x40000b08
[00:04:57.150,207] <err> os: r0/a1: 0x20000200 r1/a2: 0x20000580 r2/a3: 0x20000289
[00:04:57.150,207] <err> os: r3/a4: 0x20000588 r12/ip: 0x2003cca0 r14/lr: 0x200006a8
[00:04:57.150,238] <err> os: xpsr: 0x01000000
[00:04:57.150,238] <err> os: Faulting instruction address (r15/pc): 0x0003e97a
[00:04:57.150,268] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:04:57.150,299] <err> os: Current thread: 0x20003470 (unknown)

The kernel panics stop if I add a 1 second delay between the net_mgmt and net_if_up calls. 

It appears that I can not shut down the interface immediately after disconnecting or something is not cleaned up that causes issues when bringing the interface back up. Obviously in production code, I would like to replace the delay with a check to ensure the disconnect has completed in case it takes longer than 1 second. How can I determine when it is safe to shut down the interface?

Parents
  • Hi,

    I have added your modifications to the wifi sta sample from NCS v2.4.2. I tested this modified sample on nrf7002-dk board, but I could not reproduce your issue. In my case, the sample behaved as expected.

    If you experience some congestion in your environment, you could try to increase the value of CONFIG_NET_MGMT_EVENT_QUEUE_TIMEOUT in the prj.conf file.

    Best regards,
    Dejan

  • Hello,

    Thank you for looking into this so quickly. I apologize, I forgot to mention that I am using the main branch. I have not tested this issue specifically in the NCS v2.4.2 release.

    Idle power consumption is extremely high in the NCS v2.4.2 release, so it is not suitable for our application.

  • Could you please test with NCS v2.4.2? Is there any difference in result compared to main branch?

    Today, I tested in v2.4.2 and found, as you did, that the code I posted above code works fine in the stable branch.

    Could you elaborate on this? How do you measure current? What is the difference between idle current values when you use NCS v2.4.2 vs main branch?

    I have another post where I have been trying to get help with power consumption issues here:
    https://devzone.nordicsemi.com/f/nordic-q-a/100338/nrf7002dk-nrf5340-power-consumption
    To summarize my problem: I have found that simply adding the WIFI options into the proj.conf of a Bluetooth sample for the nRF7002dk, results in higher power consumption measured at P22. The stable versions of NCS (including v2.4.2) have over 1mA current draw when WiFi is disconnected, the interface is off and the main loop is sleeping forever. The main branch has significantly lower power consumption in the same scenario (~160uA). This is a battery powered device, WiFi will be off most of the time, 1mA of current draw is unacceptable when not using WiFi.
     

  • Hi,

    Thank you for additional information.

    Could you try using NCS v2.4.99-dev1 and v2.4.99-dev2? What are the results?

    Best regards,
    Dejan

  • Could you try using NCS v2.4.99-dev1 and v2.4.99-dev2? What are the results?

    I tested both versions as described in my first post. In v2.4.99-dev1 everything works fine for 5-10 minutes before crashing. v2.4.99-dev2 behaves much more like the main branch and crashes in less than 5 minutes.

    I did not monitor the power consumption during these tests.

  • Hi,

    Could you provide information about your measurement setup?

    Where and how did you measure current?

    Best regards,
    Dejan

  • Hello,

    I have a nRF7002dk and on it we have shorted SB17 to allow us to measure the power consumed by the nrf5340 using P22. I have removed the jumper from P22 and am using a Keithley DMM6500 to measure the current there.

    I have done similar tests using Fanstel WT02V40V modules, powering the nrf5340 at 1.8v and measuring that current using the Keithley meter and got results consistent with those measured on the nRF7002dk.

    The power consumption issues are separate from the issue I am having here. The main branch as well as the development snapshot versions you suggested appear to have a regression that makes them unstable when shutting the interface down and then bringing it back up repeatedly.

    Have you been able to reproduce this issue and can you provide me with any estimate as to when it might be fixed?

    We would like to resolve these issues as quickly as possible, this project has been at a stand still for 4 months. Is it possible to arrange a support phone call?

Reply
  • Hello,

    I have a nRF7002dk and on it we have shorted SB17 to allow us to measure the power consumed by the nrf5340 using P22. I have removed the jumper from P22 and am using a Keithley DMM6500 to measure the current there.

    I have done similar tests using Fanstel WT02V40V modules, powering the nrf5340 at 1.8v and measuring that current using the Keithley meter and got results consistent with those measured on the nRF7002dk.

    The power consumption issues are separate from the issue I am having here. The main branch as well as the development snapshot versions you suggested appear to have a regression that makes them unstable when shutting the interface down and then bringing it back up repeatedly.

    Have you been able to reproduce this issue and can you provide me with any estimate as to when it might be fixed?

    We would like to resolve these issues as quickly as possible, this project has been at a stand still for 4 months. Is it possible to arrange a support phone call?

Children
  • Hi,

    I have reproduced your initial issue. I will ask internally and get back to you when I get new information.

    Best regards,
    Dejan

  • Hi,

    Could you please have a look at this pull request and let me know if this commit fixes your issue?

    Best regards,
    Dejan

  • Hello,

    I rebased, ensuring that the commit changes were included and am able to confirm that the commit does fix the issue I was seeing. I was able to connect & disconnect without crashes around an hour.

    However, I tested this fix in my main project which I've uploaded here: https://dl.defelsko.com/downloads/nordic_test.zip 

    After about an hour I got an error:

    [2023-10-19 10:19:11] 200
    [2023-10-19 10:19:11] No ip.
    [2023-10-19 10:19:11] OK
    [2023-10-19 10:19:11] [00:48:05.573,913] <inf> sta: Disconnection request done (0)
    [2023-10-19 10:19:11] [00:48:05.580,200] <inf> sta: ==================
    [2023-10-19 10:19:11] [00:48:05.580,230] <inf> sta: State: DISCONNECTED
    [2023-10-19 10:19:11] [00:48:05.580,444] <inf> sta: Disconnect requested
    [2023-10-19 10:19:11] disconnect:0
    [2023-10-19 10:19:11] [00:48:05.598,815] <err> wifi_nrf: nrf_wifi_hal_buf_map_tx: Called for already mapped TX buffer
    [2023-10-19 10:19:11]
    [2023-10-19 10:19:11] [00:48:05.598,846] <err> os: ***** MPU FAULT *****
    [2023-10-19 10:19:11] [00:48:05.598,846] <err> os: Data Access Violation
    [2023-10-19 10:19:11] [00:48:05.598,846] <err> os: MMFAR Address: 0xd0
    [2023-10-19 10:19:11] [00:48:05.598,876] <err> os: r0/a1: 0x00009879 r1/a2: 0x00000000 r2/a3: 0x00000006
    [2023-10-19 10:19:11] [00:48:05.598,876] <err> os: r3/a4: 0x00000001 r12/ip: 0x2000270b r14/lr: 0x00035393
    [2023-10-19 10:19:11] [00:48:05.598,907] <err> os: xpsr: 0x01000000
    [2023-10-19 10:19:11] [00:48:05.598,907] <err> os: Faulting instruction address (r15/pc): 0x000353bc
    [2023-10-19 10:19:11] [00:48:05.598,937] <err> os: >>> ZEPHYR FATAL ERROR 19: Unknown error on CPU 0
    [2023-10-19 10:19:11] [00:48:05.598,968] <err> os: Current thread: 0x20003540 (unknown)
    [2023-10-19 10:19:11] [00:48:05.693,786] <err> coredump: #CD:BEGIN#
    [2023-10-19 10:19:11] [00:48:05.698,913] <err> coredump: #CD:5a4501000300050013000000
    [2023-10-19 10:19:11] [00:48:05.705,566] <err> coredump: #CD:4102004400
    [2023-10-19 10:19:11] [00:48:05.711,029] <err> coredump: #CD:799800000000000006000000010000000b27002093530300bc53030000000001
    [2023-10-19 10:19:11] [00:48:05.721,191] <err> coredump: #CD:a018012000000000000000000000000000000000000000000000000000000000
    [2023-10-19 10:19:11] [00:48:05.731,353] <err> coredump: #CD:00000000
    [2023-10-19 10:19:11] [00:48:05.736,633] <err> coredump: #CD:4d01004035002008360020
    [2023-10-19 10:19:11] [00:48:05.743,133] <err> coredump: #CD:e0380020705e0020000000000080000000000000000000000000000000000000
    [2023-10-19 10:19:11] [00:48:05.753,295] <err> coredump: #CD:b5a20600000000000000000000000000ffffffff10000000e45d0020e4100020
    [2023-10-19 10:19:11] [00:48:05.763,488] <err> coredump: #CD:64788c0084100320c0000000647800b028180120000000009835002098350020
    [2023-10-19 10:19:11] [00:48:05.773,651] <err> coredump: #CD:0000000000000000000000000000000000000000000000000000000000000000
    [2023-10-19 10:19:11] [00:48:05.783,843] <err> coredump: #CD:0000000000000000000000000000000000000000000000005812012000080000
    [2023-10-19 10:19:11] [00:48:05.794,006] <err> coredump: #CD:000000008c1500200000000000000000f8180120000000004035002000000000
    [2023-10-19 10:19:11] [00:48:05.804,168] <err> coredump: #CD:0000000000000000
    [2023-10-19 10:19:11] [00:48:05.810,150] <err> coredump: #CD:4d010058120120581a0120
    [2023-10-19 10:19:11] [00:48:05.816,650] <err> coredump: #CD:f0f0f0f0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    [2023-10-19 10:19:11] [00:48:05.826,812] <err> coredump: #CD:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    [2023-10-19 10:19:11] [00:48:05.837,005] <err> coredump: #CD:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    This project includes code from the wifi/sta, net/sockets/http_get, and bluetooth/peripheral_lbs samples. Most of the bluetooth code has been removed, but it does advertise, which I have been using when logging is off to ensure the code is still running and has not crashed. If you set the name of an access point and password in the configuration, this code should connect to wifi, hit the HTTP end point with an incrementing value, disconnect and shut down wifi, pause and bring the interface back up and reconnect every few seconds. You'll see I have commented out tracks of code where I ran the http_get portion of code in an endless loop to ensure it didn't have any memory leaks.

    The issue I was seeing has been fixed, but I am still unable to achieve stable operation with this mode of operation.

  • Hi,

    Applications can be developed with wireless coexistence. You could have a look at Wi-Fi Bluetooth LE coexistence and its documentation.

    Best regards,
    Dejan

  • Hello,

    To rule out any issues related to BLE or to wireless coexistence we have commented out the Bluetooth code in our project. It failed with exactly the same error message as provided above. We do not think wireless coexistence is related to the crashes because it still crashes even with Bluetooth disabled.

    Here is our project with Bluetooth commented out: https://dl.defelsko.com/downloads/nordic_test_no_ble.zip 

Related