Usage Fault following bt_le_per_adv_sync_delete

Hello.

I'm looking for some advice debugging what I think is a race condition around bt_le_per_adv_sync_delete.

After calling bt_le_per_adv_sync_delete, we're sometimes seeing an Illegal load of EXC_RETURN into PC fault.

[00:05:18.404,205] <err> os: ***** USAGE FAULT *****
[00:05:18.404,205] <err> os:   Illegal load of EXC_RETURN into PC
[00:05:18.404,235] <err> os: r0/a1:  0x00000000  r1/a2:  0x0006ca19  r2/a3:  0x00000091
[00:05:18.404,235] <err> os: r3/a4:  0x00000003 r12/ip:  0x00000002 r14/lr:  0x2000764c
[00:05:18.404,266] <err> os:  xpsr:  0x0003ee00
[00:05:18.404,266] <err> os: s[ 0]:  0x2000fbc8  s[ 1]:  0x2000fbc8  s[ 2]:  0x20018068  s[ 3]:  0x000a39e9
[00:05:18.404,296] <err> os: s[ 4]:  0x000a39d7  s[ 5]:  0x0008be7b  s[ 6]:  0xffffffff  s[ 7]:  0xffffffff
[00:05:18.404,296] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x0008bde9  s[11]:  0x00000000
[00:05:18.404,327] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x0009b273
[00:05:18.404,327] <err> os: fpscr:  0x00000000
[00:05:18.404,357] <err> os: r4/v1:  0x00000000  r5/v2:  0x00000000  r6/v3:  0xffffffff
[00:05:18.404,357] <err> os: r7/v4:  0xffffffff  r8/v5:  0x00000000  r9/v6:  0x00000000
[00:05:18.404,388] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20022518
[00:05:18.404,388] <err> os: EXC_RETURN: 0xfffffffd
[00:05:18.404,388] <err> os: Faulting instruction address (r15/pc): 0x20019e24
[00:08:00.736,022] <err> os: >>> ZEPHYR FATAL ERROR 34: Unknown error on CPU 0
[00:08:00.736,053] <err> os: Current thread: 0x2000fbc8 (MPSL Work)
[00:08:01.752,288] <err> fatal_error: Resetting system

It's difficult to trigger, but when I have achieved it, traces are showing that it's happening somewhere around `pending_cancels` on the MPSL workqueue. Unfortunately I don't seem to have debug symbols to be able to look further into the fault.

Parents Reply Children
  • I'm also getting this trace (forgive the screenshot):

    Are you still seeing the same crashlog that your posted initially? Every function call on this stack frame appears to be at a flash address, but the exception frame had both the PC and LR at a RAM address.

    Also, would it be possible without too much effort to provide me a minimal sample that will allow us to reproduce the same crash on our side?

  • The crashlog is varying as I'm experimenting (as we'd expect somewhat with undefined behaviour), but the consistent point is that the faulting thread is MPSL.

    The latest is faulting in flash at 0x0007363a, which resolves to `z_impl_k_sem_give`.

    The semaphore it's attempting to give is within `net_buf_data_hci_cmd_pool`.

    I'm wondering if something is overrunning an HCI buffer, maybe, or there's some cross-thread confliction with HCI access.

    I'll pull a bit further, but if I don't find anything I'll try and adapt the `periodic_sync_rsp` sample to see if I can reproduce with that.

  • Okay. I think I've found what was causing the fault - the `data` netbuf I was passing to `bt_le_per_adv_set_response_data` was being reset before `bt_le_per_adv_set_response_data` had returned.

    Is there a semaphore being stored in the netbuf's user_data?

  • I'm glad to hear that you seem to have found the problem. Unfortunately, I'm not sure I understand what you mean when you ask if a semaphore is being stored in the user_data or how that relates to the usage fault exception. In any case, I asked internally for input on this, and one of the developers commented this from the SDCs point of view:

    Setting subevent data is kind of special for the controller, since it is timing dependent.When the Host set subevent response data, the controller must send the data on air before we give command complete.

    Figure 3.13: Using a response slot in PAwR:

    There is special handling to make sure we reject any other HCI command until this has occurred:

    bluetooth: Fix assert when handling illegal response data cmd · nrfconnect/sdk-nrf@d257b1d

  • Thanks for this  - that's massively helpful to know.

    I still can't fully trace what's going on here, but somewhere we seemed to be deadlocking between the sysworkq and the workq that we were calling `bt_le_per_adv_set_response_data` from.

     

    I've shuffled around a few things to make sure that we're doing all `be_le_per_adv_*` operations from the same workq, and we now seem to be happy.

    Thank you for your help!

Related