strange bluetooth low energy behavior and assertions in zephyr 3.2 when using debug mode for compilation

Hi,

we are working on a device based on a nrf52840 using zephyr 3.2, which uses bluetooth low energy for advertising and for providing some services with some characteristics.

When building our application in debug mode, which has a lot of logging output, we are quite often encountering the following kind of assertions:

ASSERTION FAIL [!radio_is_ready()] @ WEST_TOPDIR/zephyr/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c:573
[00:01:13.253,601] <err> os: r0/a1:  0x00000003  r1/a2:  0x2002bddc  r2/a3:  0x00000000
[00:01:13.253,631] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00084743
[00:01:13.253,631] <err> os:  xpsr:  0x61000011
[00:01:13.253,662] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:01:13.253,692] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:01:13.253,692] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:01:13.253,723] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:01:13.253,723] <err> os: fpscr:  0x000abed8
[00:01:13.253,753] <err> os: Faulting instruction address (r15/pc): 0x0008474e
[00:01:13.253,784] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:01:13.253,814] <err> os: Fault during interrupt handling
[00:01:13.253,845] <err> os: Current thread: 0x20008de0 (idle)
[00:01:13.314,453] <err> os: Halting system

It seems that there is some kind of timing-problem going on, because the assertion occurs more often, if more logging output is enabled. It occurs extremely often, when CONFIG_PRINTK_SYNC is enabled.

And we encountered another possibly related problem, which also does only occur in debug mode. When this happens, other bluetooth low energy receiving devices are showing up one or more additional advertising devices, with kind of randomly selected addresses, which do not exist. It seems that all the bluetooth advertising air-traffic gets totally mixed up from the one zephyr device in debug mode. We used the nrf-sniffer to analyze more deeply whats going on in this situation, but we found only totally weird stuff.

So we are having the hope that the crazy advertising is also caused by some timing problems ... and when having a solution for the assertion, this might also solve the problem with the crazy advertising.

Does anybody know anything about the assertion?
Are there some general rules about how to setup a project with bluetooth low energy in terms of thread scheduling and stuff like that?

Regard

Volker

  • Hi Volker,

    You mentioned Zephyr 3.2, and the log shows use of the Zephyr Controller. Are you using vanilla Zephyr RTOS? We recommend using the nRF Connect SDK for development.

    For your issue, debug mode builds often cause less efficient code, so the application requires more memory and timing margins.

    One common issue is threads running out of stack, so please try increasing your threads' stack size first.

    If the problem persists, please try to turn off logging or reducing the log level of the modules that you are not debugging.

    Please let me know if any of those helps.

    Regards,

    Hieu

  • Hi Hieu,

    Thank you for your reply.

    Yes, we are using vanilla zephyr since several years. We are within the step to upgrade from zephyr 2.4 to zephyr 3.2 so we might have some minor chance to use this drive to eventually switch over to nRF Connect SDK. Are there some major benefits from doing so?

    We had several stack overflows during the migration process to zephyr 3.2 and now I would assume that we have increased the stack size of all relevant threads. We used CONFIG_THREAD_ANALYZER to get regular output about the actual used stack size, but I am not quite sure if this covers all dynamic of stack usage properly. Do you have some more hints of how to make sure that all stacks are big enough?

    Reducing the amount of logging output is not really an option, as we are just using our application specific logging, which is almost essential. Other built in component logging has been disabled already.

    Do you have any other suggestions of how to track down and potentially fix the root cause for the bluetooth processing delays? Or how to optimize this?

    Regards,

    Volker

  • I found the CONFIG_BT_CTLR_PROFILE_ISR config option, but I am not quite sure about how to interpret the output. Can you give me some advise?

  • Hi Volker,

    Volker said:
    Yes, we are using vanilla zephyr since several years. We are within the step to upgrade from zephyr 2.4 to zephyr 3.2 so we might have some minor chance to use this drive to eventually switch over to nRF Connect SDK. Are there some major benefits from doing so?

    The nRF Connect SDK (NCS) supports the SoftDevice Controller (SDC), which is Nordic's proprietary controller that was designed to work best with Nordic's products. It also includes the (also proprietary) Multiprotocol Service Layer, MPSL, which as its name says, enables multi-protocol support.

    We also perform Bluetooth qualification on the Zephyr Host and SDC combination, so as long as there is no modification to the Bluetooth LE stack, you can refer to our QDID during your product's qualification process. The Zephyr Controller doesn't have this.

    Last but not least, as Nordic own the SDC completely, we can provide you better technical support for issues regarding it.

    Volker said:
    We had several stack overflows during the migration process to zephyr 3.2 and now I would assume that we have increased the stack size of all relevant threads. We used CONFIG_THREAD_ANALYZER to get regular output about the actual used stack size, but I am not quite sure if this covers all dynamic of stack usage properly. Do you have some more hints of how to make sure that all stacks are big enough?

    I had a similar issue back in NCS 2.3.0, which coincidentally (?) is also based on Zephyr 3.2. What I observed is that even with thread analyzer showing that the stack uses are almost always fine, the system still crashes, and a further increase in stack size did help.

    I guess that a fairly large spike in stack use caused the overflow, and thus the thread analyzer didn't have a chance to show that the impending issue.

    Volker said:
    Reducing the amount of logging output is not really an option, as we are just using our application specific logging, which is almost essential. Other built in component logging has been disabled already.

    I see. But then you should at least disable CONFIG_PRINTK_SYNC. If you are worried about losing logs, it's better to increase the buffer size instead.

    If you are using printk() for logging, the config is CONFIG_PRINTK_BUFFER_SIZE.

    If you are using the logging module, the config is CONFIG_LOG_BUFFER_SIZE.

    Is this possible? If so, please try it and see if the situation improve.

    Regards,

    Hieu

  • Hi Hieu,

    thank you for your detailed reply.

    I found some youtube comparison of zephyr-vanialla and ncs. There was mentioned some difference in the licensing model. Is it right that using ncs is ok, as long as it runs on some nrf hardware (cpu)?

    As soon as I realized, that CONFIG_PRINTK_SYNC increased the problem, I deactivated it immediately.

    In the meantime I was able to make some progress ... and found out, that there was some overlap between application thread priorities and bluetooth related thread priorities. When giving all the application threads lower priorities than the bluetooth threads, the problems seems to disappear.

    This was kind of surprising, as the same application was running on zephyr 2.4 without having these problems.

    Regards,

    Volker

Related