NCS Update to v2.5.2: MPSL Hard Fault on Advertisement

Hello,

I'm not a big fan of updating the NRF Connect SDK, because experience has shown that every time so many things have been changed that I am always busy for at least a whole day until my application is compatible again. Accordingly, we always wait some time before we carry out such an update so that as many bugs as possible have been removed.

This time, however, there seem to have been major changes in the SDK that destroyed some of the core functionalitites.

We use the nrf52840.
We have now updated the NCS v2.4.0 to v2.5.2.
For building we use the official Docker image nordicplayground/nrfconnect-sdk:v2.5-branch

With our application, we ALWAYS receive the following hard fault after approx. 34 minutes:

[00:33:56.096,374] <err> mpsl_init: MPSL ASSERT: 109, 300
[00:33:56.096,374] <err> os: ***** HARD FAULT *****
[00:33:56.096,405] <err> os:   Fault escalation (see below)
[00:33:56.096,405] <err> os: ARCH_EXCEPT with reason 3
[00:33:56.096,435] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000060  r2/a3:  0x0000005f
[00:33:56.096,466] <err> os: r3/a4:  0x200043e0 r12/ip:  0x00000000 r14/lr:  0x00053c4b
[00:33:56.096,466] <err> os:  xpsr:  0x41000011
[00:33:56.096,496] <err> os: s[ 0]:  0x0000000a  s[ 1]:  0x0007675d  s[ 2]:  0x00000001  s[ 3]:  0x0003c06d
[00:33:56.096,496] <err> os: s[ 4]:  0x00393031  s[ 5]:  0x00000020  s[ 6]:  0x00000000  s[ 7]:  0x0003b25d
[00:33:56.096,527] <err> os: s[ 8]:  0x00000020  s[ 9]:  0x0003951b  s[10]:  0x20003000  s[11]:  0x00000000
[00:33:56.096,557] <err> os: s[12]:  0x03fa43d2  s[13]:  0x00000000  s[14]:  0x00000001  s[15]:  0x00000000
[00:33:56.096,557] <err> os: fpscr:  0x20003003
[00:33:56.096,588] <err> os: Faulting instruction address (r15/pc): 0x00076798
[00:33:56.096,618] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:33:56.096,618] <err> os: Fault during interrupt handling
[00:33:56.096,649] <err> os: Current thread: 0x200043e0 (unknown)
[00:33:56.985,443] <err> os: Halting system

It can be seen that this has to do with an interrupt from the MPSL library. The address itself only resolves to:
/workdir/nrf/subsys/mpsl/init/mpsl_init.c:191
which is the handler itself.

I was able to locate the error so far that it has to do with the BLE advertisement.

Namely in the following function:

void update_advertisement_data(struct bt_le_ext_adv* adv, IAdvertisementManager* adv_manager)
{
    if(adv_manager->is_running())
    {
        auto my_data = adv_manager->get_next();
        if(my_data != nullptr)
        {
            auto err = bt_le_ext_adv_set_data(adv, my_data->ptr, my_data->len, NULL, 0);
            if(err)
            {
                printk("Failed to set advertising data (err %d)\n", err);
            }
        }

        if(atomic_test_bit(adv->flags, BT_ADV_ENABLED))
        {
            return;
        }
        bt_le_ext_adv_start(adv, &ext_adv_start_param);
    }
}

With the following advertisement configuration:

static struct bt_le_adv_param* connectable_adv_param =
    BT_LE_ADV_PARAM(BT_LE_ADV_OPT_CONNECTABLE | BT_LE_ADV_OPT_USE_NAME,
                    BT_GAP_ADV_FAST_INT_MIN_2,
                    BT_GAP_ADV_FAST_INT_MAX_2,
                    NULL);
static struct bt_le_ext_adv_start_param ext_adv_start_param = BT_LE_EXT_ADV_START_PARAM_INIT((0), (3));

Explanation

The advertisement data changes constantly during runtime. For this purpose, the above function is called every second in a main loop, depending on whether the last data packet was sent or not.

As I understand it, it is not enough to simply set the new data, because this is then not automatically advertised.

The BT_ADV_ENABLED check is redundant here, bt_le_ext_adv_start also does the same - this is known.

Notable

If I omit the following code, my application runs normally, but won't advertise anymore.

if(atomic_test_bit(adv->flags, BT_ADV_ENABLED))
{
	return;
}
bt_le_ext_adv_start(adv, &ext_adv_start_param);

As I said, the application always crashes after approx. 34 minutes. That is around 2040 seconds, i.e. almost 2048 = 2^11. This cannot be a coincidence.

And again, up to NCS v2.4 everything works without any problems.

We have observed this behavior on all our boards so far.

Parents
  • Hello,

    We have started to look into this and have some questions:

    • Do you use a 32.768 kHz crystal or the LFRC as low frequency clock source?
    • Is a connection established at any point, or is the device only starting and stopping advertising when this happens?
    • Do you by any chance use the mpsl_clock API, for instance by calling mpsl_clock_hfclk_request() and/or mpsl_clock_hfclk_release()?
    • Is it possible for you to share code that reproduce this that we can test on our end?
  • Do you use a 32.768 kHz crystal or the LFRC as low frequency clock source?

    We use the internal RC as LFCLK.

    Is a connection established at any point, or is the device only starting and stopping advertising when this happens?

    No connections are established during these tests. We only use the nRF Connect app for android to check which advertisement data is sent from the device.
    The advertisement itself works fine until the device crashes after approx. 34 minutes.

    Do you by any chance use the mpsl_clock API, for instance by calling mpsl_clock_hfclk_request() and/or mpsl_clock_hfclk_release()?

    We don't use any functionalities of the MPSL library, but this does not necessarily mean that none of the used interfaces may use them after all. We do not know that.

    Among other things we use:

    • SPI Slave
    • UART
    • RTC
    • TIMER
    • External Flash via QSPI
    • Nordic UART Service (NUS)
    • Bootloader Chain
    • etc
    Is it possible for you to share code that reproduce this that we can test on our end?

    I will try to reproduce the error in a minimal project. You can imagine that this is not so easy, as I only know whether the problem exists after about 34 minutes.

    I am also busy with other tasks, so it may well take until the end of next week.

  • Hi,

    Thank you. We are looking forward to hopefully getting code from you that reproduce it. In the mean time, we have some additional questions:

    • Can you try to change CONFIG_CLOCK_CONTROL_NRF_CALIBRATION_PERIOD to 250 (from the default 4000) and see if the issue happens quicker then?
    • Can you conect with a debugger once the assert has happened, and use "west attach" then "run bt" and upload the resulting back trace here?
    • Is the application touching the POWR_CLOCK peripheral in any way, or modifiing the
      POWER_CLOCK_IRQn in NVIC?
  • Hi,

    Just a short follow-up from me while waiting for your input. We have reproduced the same assert in a simple application that just starts advertising, and then calls NVIC_DisableIRQ(POWER_CLOCK_IRQn). Our theory is that somethign similar to this happens in your application. (Maybe not exactly that Clock interrupt is disabled in NVIC, but something else that makes clock interrupt not get to the MPSL interrupt handler).

Reply
  • Hi,

    Just a short follow-up from me while waiting for your input. We have reproduced the same assert in a simple application that just starts advertising, and then calls NVIC_DisableIRQ(POWER_CLOCK_IRQn). Our theory is that somethign similar to this happens in your application. (Maybe not exactly that Clock interrupt is disabled in NVIC, but something else that makes clock interrupt not get to the MPSL interrupt handler).

Children
No Data
Related