Usage Fault following bt_le_per_adv_sync_delete

Hello.

I'm looking for some advice debugging what I think is a race condition around bt_le_per_adv_sync_delete.

After calling bt_le_per_adv_sync_delete, we're sometimes seeing an Illegal load of EXC_RETURN into PC fault.

[00:05:18.404,205] <err> os: ***** USAGE FAULT *****
[00:05:18.404,205] <err> os:   Illegal load of EXC_RETURN into PC
[00:05:18.404,235] <err> os: r0/a1:  0x00000000  r1/a2:  0x0006ca19  r2/a3:  0x00000091
[00:05:18.404,235] <err> os: r3/a4:  0x00000003 r12/ip:  0x00000002 r14/lr:  0x2000764c
[00:05:18.404,266] <err> os:  xpsr:  0x0003ee00
[00:05:18.404,266] <err> os: s[ 0]:  0x2000fbc8  s[ 1]:  0x2000fbc8  s[ 2]:  0x20018068  s[ 3]:  0x000a39e9
[00:05:18.404,296] <err> os: s[ 4]:  0x000a39d7  s[ 5]:  0x0008be7b  s[ 6]:  0xffffffff  s[ 7]:  0xffffffff
[00:05:18.404,296] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x0008bde9  s[11]:  0x00000000
[00:05:18.404,327] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x0009b273
[00:05:18.404,327] <err> os: fpscr:  0x00000000
[00:05:18.404,357] <err> os: r4/v1:  0x00000000  r5/v2:  0x00000000  r6/v3:  0xffffffff
[00:05:18.404,357] <err> os: r7/v4:  0xffffffff  r8/v5:  0x00000000  r9/v6:  0x00000000
[00:05:18.404,388] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20022518
[00:05:18.404,388] <err> os: EXC_RETURN: 0xfffffffd
[00:05:18.404,388] <err> os: Faulting instruction address (r15/pc): 0x20019e24
[00:08:00.736,022] <err> os: >>> ZEPHYR FATAL ERROR 34: Unknown error on CPU 0
[00:08:00.736,053] <err> os: Current thread: 0x2000fbc8 (MPSL Work)
[00:08:01.752,288] <err> fatal_error: Resetting system

It's difficult to trigger, but when I have achieved it, traces are showing that it's happening somewhere around `pending_cancels` on the MPSL workqueue. Unfortunately I don't seem to have debug symbols to be able to look further into the fault.

Parents
  • Having done a bit more investigation, I think the fault may be triggered by the following sequence of events:

    1. Establish a PAwR Sync, and set a Subevent
    2. Call bt_le_per_adv_set_response_data
    3. On the advertiser side, restart advertising
    4. Observe crash

    I suspect this could be caused by cleanup logic from sync termination running before the `bt_le_per_adv_set_response_data` call has completed?

  • It does not appear to be a stack overflow, since you have already profiled the stack usage and the MPU stack guard did not trigger an exception. However, the xPSR value at least indicates that the stack frame is corrupted, so could be a buffer overrun, etc). I also see that LR and PC is holding RAM addresses. I'm not sure if that is correct or not. I would suggest looking up these addresses in the zephyr.map file to check if they fall into any of the text or data sections.

    You mentioned that the issue is difficult to trigger. Is that still the case? If you are able to reproduce it consistently, you could try placing a breakpoint at address 0x20019e24 in RAM and see where execution branches from when it hits this address. Would also be interesting to know if you see the same without CONFIG_FPU. 

  • I'm also getting this trace (forgive the screenshot):

    Are you still seeing the same crashlog that your posted initially? Every function call on this stack frame appears to be at a flash address, but the exception frame had both the PC and LR at a RAM address.

    Also, would it be possible without too much effort to provide me a minimal sample that will allow us to reproduce the same crash on our side?

  • The crashlog is varying as I'm experimenting (as we'd expect somewhat with undefined behaviour), but the consistent point is that the faulting thread is MPSL.

    The latest is faulting in flash at 0x0007363a, which resolves to `z_impl_k_sem_give`.

    The semaphore it's attempting to give is within `net_buf_data_hci_cmd_pool`.

    I'm wondering if something is overrunning an HCI buffer, maybe, or there's some cross-thread confliction with HCI access.

    I'll pull a bit further, but if I don't find anything I'll try and adapt the `periodic_sync_rsp` sample to see if I can reproduce with that.

  • Okay. I think I've found what was causing the fault - the `data` netbuf I was passing to `bt_le_per_adv_set_response_data` was being reset before `bt_le_per_adv_set_response_data` had returned.

    Is there a semaphore being stored in the netbuf's user_data?

  • I'm glad to hear that you seem to have found the problem. Unfortunately, I'm not sure I understand what you mean when you ask if a semaphore is being stored in the user_data or how that relates to the usage fault exception. In any case, I asked internally for input on this, and one of the developers commented this from the SDCs point of view:

    Setting subevent data is kind of special for the controller, since it is timing dependent.When the Host set subevent response data, the controller must send the data on air before we give command complete.

    Figure 3.13: Using a response slot in PAwR:

    There is special handling to make sure we reject any other HCI command until this has occurred:

    bluetooth: Fix assert when handling illegal response data cmd · nrfconnect/sdk-nrf@d257b1d

  • Thanks for this  - that's massively helpful to know.

    I still can't fully trace what's going on here, but somewhere we seemed to be deadlocking between the sysworkq and the workq that we were calling `bt_le_per_adv_set_response_data` from.

     

    I've shuffled around a few things to make sure that we're doing all `be_le_per_adv_*` operations from the same workq, and we now seem to be happy.

    Thank you for your help!

Reply
  • Thanks for this  - that's massively helpful to know.

    I still can't fully trace what's going on here, but somewhere we seemed to be deadlocking between the sysworkq and the workq that we were calling `bt_le_per_adv_set_response_data` from.

     

    I've shuffled around a few things to make sure that we're doing all `be_le_per_adv_*` operations from the same workq, and we now seem to be happy.

    Thank you for your help!

Children
No Data
Related