Usage Fault following bt_le_per_adv_sync_delete

Hello.

I'm looking for some advice debugging what I think is a race condition around bt_le_per_adv_sync_delete.

After calling bt_le_per_adv_sync_delete, we're sometimes seeing an Illegal load of EXC_RETURN into PC fault.

[00:05:18.404,205] <err> os: ***** USAGE FAULT *****
[00:05:18.404,205] <err> os:   Illegal load of EXC_RETURN into PC
[00:05:18.404,235] <err> os: r0/a1:  0x00000000  r1/a2:  0x0006ca19  r2/a3:  0x00000091
[00:05:18.404,235] <err> os: r3/a4:  0x00000003 r12/ip:  0x00000002 r14/lr:  0x2000764c
[00:05:18.404,266] <err> os:  xpsr:  0x0003ee00
[00:05:18.404,266] <err> os: s[ 0]:  0x2000fbc8  s[ 1]:  0x2000fbc8  s[ 2]:  0x20018068  s[ 3]:  0x000a39e9
[00:05:18.404,296] <err> os: s[ 4]:  0x000a39d7  s[ 5]:  0x0008be7b  s[ 6]:  0xffffffff  s[ 7]:  0xffffffff
[00:05:18.404,296] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x0008bde9  s[11]:  0x00000000
[00:05:18.404,327] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x0009b273
[00:05:18.404,327] <err> os: fpscr:  0x00000000
[00:05:18.404,357] <err> os: r4/v1:  0x00000000  r5/v2:  0x00000000  r6/v3:  0xffffffff
[00:05:18.404,357] <err> os: r7/v4:  0xffffffff  r8/v5:  0x00000000  r9/v6:  0x00000000
[00:05:18.404,388] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20022518
[00:05:18.404,388] <err> os: EXC_RETURN: 0xfffffffd
[00:05:18.404,388] <err> os: Faulting instruction address (r15/pc): 0x20019e24
[00:08:00.736,022] <err> os: >>> ZEPHYR FATAL ERROR 34: Unknown error on CPU 0
[00:08:00.736,053] <err> os: Current thread: 0x2000fbc8 (MPSL Work)
[00:08:01.752,288] <err> fatal_error: Resetting system

It's difficult to trigger, but when I have achieved it, traces are showing that it's happening somewhere around `pending_cancels` on the MPSL workqueue. Unfortunately I don't seem to have debug symbols to be able to look further into the fault.

Parents
  • Having done a bit more investigation, I think the fault may be triggered by the following sequence of events:

    1. Establish a PAwR Sync, and set a Subevent
    2. Call bt_le_per_adv_set_response_data
    3. On the advertiser side, restart advertising
    4. Observe crash

    I suspect this could be caused by cleanup logic from sync termination running before the `bt_le_per_adv_set_response_data` call has completed?

  • It does not appear to be a stack overflow, since you have already profiled the stack usage and the MPU stack guard did not trigger an exception. However, the xPSR value at least indicates that the stack frame is corrupted, so could be a buffer overrun, etc). I also see that LR and PC is holding RAM addresses. I'm not sure if that is correct or not. I would suggest looking up these addresses in the zephyr.map file to check if they fall into any of the text or data sections.

    You mentioned that the issue is difficult to trigger. Is that still the case? If you are able to reproduce it consistently, you could try placing a breakpoint at address 0x20019e24 in RAM and see where execution branches from when it hits this address. Would also be interesting to know if you see the same without CONFIG_FPU. 

  • I've tried with `CONFIG_FPU=n` - this still faults unfortunately.

  • So far, everything is pointing back to a failure when resolving the semaphore that `bt_le_per_adv_set_response_data` (deep in `bt_hci_cmd_send_sync` / `process_pending_cmd`) is awaiting.

    When the `hci_cmd_done` stack is trying to resolve the semaphore, we see the fault.

    It's difficult to be sure, though. Clearly sticking a breakpoint in either of those chains isn't practical, given the number of bluetooth operations we're going through to get here.

  • My current conclusion is that the softdevice isn't correctly responding to `BT_HCI_OP_LE_SET_PER_ADV_RESPONSE_DATA` in some cases, or is doing so with a very large delay.

  • I'm also getting this trace (forgive the screenshot):

    Are you still seeing the same crashlog that your posted initially? Every function call on this stack frame appears to be at a flash address, but the exception frame had both the PC and LR at a RAM address.

    Also, would it be possible without too much effort to provide me a minimal sample that will allow us to reproduce the same crash on our side?

  • The crashlog is varying as I'm experimenting (as we'd expect somewhat with undefined behaviour), but the consistent point is that the faulting thread is MPSL.

    The latest is faulting in flash at 0x0007363a, which resolves to `z_impl_k_sem_give`.

    The semaphore it's attempting to give is within `net_buf_data_hci_cmd_pool`.

    I'm wondering if something is overrunning an HCI buffer, maybe, or there's some cross-thread confliction with HCI access.

    I'll pull a bit further, but if I don't find anything I'll try and adapt the `periodic_sync_rsp` sample to see if I can reproduce with that.

Reply
  • The crashlog is varying as I'm experimenting (as we'd expect somewhat with undefined behaviour), but the consistent point is that the faulting thread is MPSL.

    The latest is faulting in flash at 0x0007363a, which resolves to `z_impl_k_sem_give`.

    The semaphore it's attempting to give is within `net_buf_data_hci_cmd_pool`.

    I'm wondering if something is overrunning an HCI buffer, maybe, or there's some cross-thread confliction with HCI access.

    I'll pull a bit further, but if I don't find anything I'll try and adapt the `periodic_sync_rsp` sample to see if I can reproduce with that.

Children
Related