SoftDevice Controller ASSERT: 53, 296

Hi!

Can you please help me identify what is causing this issue and/or provide more info about this:

[00:04:23.038,330] \033[1;31m<err> bt_sdc_hci_driver: SoftDevice Controller ASSERT: 53, 296\033[0m
[00:04:57.431,335] \033[1;31m<err> os: ***** HARD FAULT *****\033[0m
[00:04:57.431,335] \033[1;31m<err> os:   Fault escalation (see below)\033[0m
[00:04:57.431,365] \033[1;31m<err> os: ARCH_EXCEPT with reason 3
\033[0m
[00:04:57.431,365] \033[1;31m<err> os: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x0001981b\033[0m
[00:04:57.431,396] \033[1;31m<err> os: r3/a4:  0x00000000 r12/ip:  0x20000ab0 r14/lr:  0xffffffff\033[0m
[00:04:57.431,396] \033[1;31m<err> os:  xpsr:  0x41000011\033[0m
[00:04:57.431,396] \033[1;31m<err> os: r4/v1:  0x20015c70  r5/v2:  0x0003370d  r6/v3:  0x0000000a\033[0m
[00:04:57.431,427] \033[1;31m<err> os: r7/v4:  0x20015c70  r8/v5:  0x20001944  r9/v6:  0x2000e204\033[0m
[00:04:57.431,427] \033[1;31m<err> os: r10/v7: 0x200008bc  r11/v8: 0x00000000    psp:  0x20015f40\033[0m
[00:04:57.431,457] \033[1;31m<err> os: EXC_RETURN: 0xfffffff1\033[0m
[00:04:57.431,457] \033[1;31m<err> os: Faulting instruction address (r15/pc): 0x00033744\033[0m
[00:04:57.431,488] \033[1;31m<err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0\033[0m
[00:04:57.431,518] \033[1;31m<err> os: Fault during interrupt handling
\033[0m
[00:04:57.431,549] \033[1;31m<err> os: Current thread: 0x2000ae30 (unknown)\033[0m
[00:05:09.972,747] \033[1;31m<err> fatal_error: Resetting system\033[0m

When this happens it is in the process of uploading a new fw thru mcumgr image group, upload command.

I haven't been able to fully verify this but I think this affected by updating to SDK 2.6.0 (from 2.4.0).
That code part is unchanged before and after the SDK update except that "zcbor_new_decode_state" had two new parameter which we have set to NULL and 0 (after comparing how some examples had changed between sdk versions).

BR,
Mårten

Parents
  • Hi Mårten,

    The assert is triggered by an "overstay" event in the SoftDevice controller scheduling related to BLE central operation. So something has prevetned that code from running on time. That something can be higher priority threads or interrupts. I cannot say anything more specific based on the assert, but perhaps disconnect any central links when initiating DFU?

Reply
  • Hi Mårten,

    The assert is triggered by an "overstay" event in the SoftDevice controller scheduling related to BLE central operation. So something has prevetned that code from running on time. That something can be higher priority threads or interrupts. I cannot say anything more specific based on the assert, but perhaps disconnect any central links when initiating DFU?

Children
  • Hi!

    Thanks for the reply.

    Maybe wasn't that clear but the assert is happening on the central that is sending the image to a peripheral.
    There is only one connection doing an update and possibly one more connection simultaneously (to different peripheral device) sending a smaller amount of data.

    I have however managed to reproduce the issue with just a single peripheral.

    Is there some "bluetooth" thread where the priority can be raised?

    I guess this is deeper down and therefore not connected to what is done in the callback passed to bt_dfu_smp_command?

    BR,
    Mårten

  • Hi Mårten,

    I see. I would prefrably try to find out what is (probably) taking time in a fairly high priority, rather than up the priority of the stack. That said, there are some configs you can experiment with:

    • CONFIG_BT_CTLR_LLL_PRIO
    • CONFIG_BT_CTLR_ULL_HIGH_PRIO
    • CONFIG_BT_CTLR_ULL_LOW_PRIO

    Br,

    Einar

  • Hi!

    Additional question regarding this:
    How closely is this tied to various callbacks ?

    Could doing to much processing in a scan_le_cb (recv) callback introduce such a problem?

    Or is this seperated and only a more systemwide depletion of resources would make the scan callback affect the this?

    Also I haven't been able to test with the configs mentioned as they all depend on BT_LL_SW_SPLIT being set, which we don't have in our setup and when I try to set it the device stop working (doesn't seem to "see" any advertisements done by other devices). I haven't had the opportunity to test further with this.

    What I know we havenät done any changes regarding interrupts and haven't added any custom driver or similar.


    BR,

    Mårten

  • Hi Mårten,

    It is difficult to say what caused the issue, as the assert only tells us that the SoftDevice Controller was not able to handle a radio event on time. But in principle enything (with same or higher priority) could have prevented it from doing what it should.

    Maos said:
    Could doing to much processing in a scan_le_cb (recv) callback introduce such a problem?

    Yes, that is possible.

    Maos said:
    Or is this seperated and only a more systemwide depletion of resources would make the scan callback affect the this?

    That is also possible (though low priority threads will not cause this).

    Maos said:
    Also I haven't been able to test with the configs mentioned as they all depend on BT_LL_SW_SPLIT being set

    Yes, that is correct. I was too quick when I dug up these configs. They are not relevant for the SoftDevice Controller. In fact, there are no configs for setting these priorities (you can see all the SoftDevice Controller speicfic configs in nrf/subsys/bluetooth/controller/Kconfig). 

    Have you experimented and noticed if there are any configs that affect if you see this issue or not (related to SMP or bluetooth in general) that may perhaps point at something worth looking deeper into?

  • Hi!

    I have tried various settings to get a lead what could be causing this without much luck.

    Regarding threads this is the status of them at the time the exception happens:

    And this is the callstack:


    I don't know if they provide any new lead on what is causing this.

    I have tested this with the v2,4 and v2.6 stack and I can't reproduce it on the v2.4 stack (only code change during the test is three locations where we call zcbor_new_decode_state).

    I have doing various adjustments to our thread priorites and also tested to enable CONFIG_BT_RECV_WORKQ_BT.

    I tried to see if there where any interesting configs related to irq priorites but I couldn't find a working combination of settings (Mostly NFCT_IRQ_PRIORITY and ZERO_LATENCY_LEVELS).

    (The device receive and send data on NFC but I don't need to blipp it to make this issue happen).

    But none of these changes seem to solve the issue.


    BR,

    Mårten

Related