Usage Fault following bt_le_per_adv_sync_delete

Hello.

I'm looking for some advice debugging what I think is a race condition around bt_le_per_adv_sync_delete.

After calling bt_le_per_adv_sync_delete, we're sometimes seeing an Illegal load of EXC_RETURN into PC fault.

[00:05:18.404,205] <err> os: ***** USAGE FAULT *****
[00:05:18.404,205] <err> os:   Illegal load of EXC_RETURN into PC
[00:05:18.404,235] <err> os: r0/a1:  0x00000000  r1/a2:  0x0006ca19  r2/a3:  0x00000091
[00:05:18.404,235] <err> os: r3/a4:  0x00000003 r12/ip:  0x00000002 r14/lr:  0x2000764c
[00:05:18.404,266] <err> os:  xpsr:  0x0003ee00
[00:05:18.404,266] <err> os: s[ 0]:  0x2000fbc8  s[ 1]:  0x2000fbc8  s[ 2]:  0x20018068  s[ 3]:  0x000a39e9
[00:05:18.404,296] <err> os: s[ 4]:  0x000a39d7  s[ 5]:  0x0008be7b  s[ 6]:  0xffffffff  s[ 7]:  0xffffffff
[00:05:18.404,296] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x0008bde9  s[11]:  0x00000000
[00:05:18.404,327] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x0009b273
[00:05:18.404,327] <err> os: fpscr:  0x00000000
[00:05:18.404,357] <err> os: r4/v1:  0x00000000  r5/v2:  0x00000000  r6/v3:  0xffffffff
[00:05:18.404,357] <err> os: r7/v4:  0xffffffff  r8/v5:  0x00000000  r9/v6:  0x00000000
[00:05:18.404,388] <err> os: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x20022518
[00:05:18.404,388] <err> os: EXC_RETURN: 0xfffffffd
[00:05:18.404,388] <err> os: Faulting instruction address (r15/pc): 0x20019e24
[00:08:00.736,022] <err> os: >>> ZEPHYR FATAL ERROR 34: Unknown error on CPU 0
[00:08:00.736,053] <err> os: Current thread: 0x2000fbc8 (MPSL Work)
[00:08:01.752,288] <err> fatal_error: Resetting system

It's difficult to trigger, but when I have achieved it, traces are showing that it's happening somewhere around `pending_cancels` on the MPSL workqueue. Unfortunately I don't seem to have debug symbols to be able to look further into the fault.

Parents
  • Having done a bit more investigation, I think the fault may be triggered by the following sequence of events:

    1. Establish a PAwR Sync, and set a Subevent
    2. Call bt_le_per_adv_set_response_data
    3. On the advertiser side, restart advertising
    4. Observe crash

    I suspect this could be caused by cleanup logic from sync termination running before the `bt_le_per_adv_set_response_data` call has completed?

  • It does not appear to be a stack overflow, since you have already profiled the stack usage and the MPU stack guard did not trigger an exception. However, the xPSR value at least indicates that the stack frame is corrupted, so could be a buffer overrun, etc). I also see that LR and PC is holding RAM addresses. I'm not sure if that is correct or not. I would suggest looking up these addresses in the zephyr.map file to check if they fall into any of the text or data sections.

    You mentioned that the issue is difficult to trigger. Is that still the case? If you are able to reproduce it consistently, you could try placing a breakpoint at address 0x20019e24 in RAM and see where execution branches from when it hits this address. Would also be interesting to know if you see the same without CONFIG_FPU. 

  • Thanks Vidar.

    I've managed to find a reliable reproduction using the steps above (although there could be something about my setup that is the specific cause).

    I've tried putting a breakpoint in at 0x20019e24 - it is not triggered.

    (From GDB: `break *0x20019e24`)

Reply Children
Related