NCS Update to v2.5.2: MPSL Hard Fault on Advertisement

Hello,

I'm not a big fan of updating the NRF Connect SDK, because experience has shown that every time so many things have been changed that I am always busy for at least a whole day until my application is compatible again. Accordingly, we always wait some time before we carry out such an update so that as many bugs as possible have been removed.

This time, however, there seem to have been major changes in the SDK that destroyed some of the core functionalitites.

We use the nrf52840.
We have now updated the NCS v2.4.0 to v2.5.2.
For building we use the official Docker image nordicplayground/nrfconnect-sdk:v2.5-branch

With our application, we ALWAYS receive the following hard fault after approx. 34 minutes:

[00:33:56.096,374] <err> mpsl_init: MPSL ASSERT: 109, 300
[00:33:56.096,374] <err> os: ***** HARD FAULT *****
[00:33:56.096,405] <err> os:   Fault escalation (see below)
[00:33:56.096,405] <err> os: ARCH_EXCEPT with reason 3
[00:33:56.096,435] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000060  r2/a3:  0x0000005f
[00:33:56.096,466] <err> os: r3/a4:  0x200043e0 r12/ip:  0x00000000 r14/lr:  0x00053c4b
[00:33:56.096,466] <err> os:  xpsr:  0x41000011
[00:33:56.096,496] <err> os: s[ 0]:  0x0000000a  s[ 1]:  0x0007675d  s[ 2]:  0x00000001  s[ 3]:  0x0003c06d
[00:33:56.096,496] <err> os: s[ 4]:  0x00393031  s[ 5]:  0x00000020  s[ 6]:  0x00000000  s[ 7]:  0x0003b25d
[00:33:56.096,527] <err> os: s[ 8]:  0x00000020  s[ 9]:  0x0003951b  s[10]:  0x20003000  s[11]:  0x00000000
[00:33:56.096,557] <err> os: s[12]:  0x03fa43d2  s[13]:  0x00000000  s[14]:  0x00000001  s[15]:  0x00000000
[00:33:56.096,557] <err> os: fpscr:  0x20003003
[00:33:56.096,588] <err> os: Faulting instruction address (r15/pc): 0x00076798
[00:33:56.096,618] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:33:56.096,618] <err> os: Fault during interrupt handling
[00:33:56.096,649] <err> os: Current thread: 0x200043e0 (unknown)
[00:33:56.985,443] <err> os: Halting system

It can be seen that this has to do with an interrupt from the MPSL library. The address itself only resolves to:
/workdir/nrf/subsys/mpsl/init/mpsl_init.c:191
which is the handler itself.

I was able to locate the error so far that it has to do with the BLE advertisement.

Namely in the following function:

void update_advertisement_data(struct bt_le_ext_adv* adv, IAdvertisementManager* adv_manager)
{
    if(adv_manager->is_running())
    {
        auto my_data = adv_manager->get_next();
        if(my_data != nullptr)
        {
            auto err = bt_le_ext_adv_set_data(adv, my_data->ptr, my_data->len, NULL, 0);
            if(err)
            {
                printk("Failed to set advertising data (err %d)\n", err);
            }
        }

        if(atomic_test_bit(adv->flags, BT_ADV_ENABLED))
        {
            return;
        }
        bt_le_ext_adv_start(adv, &ext_adv_start_param);
    }
}

With the following advertisement configuration:

static struct bt_le_adv_param* connectable_adv_param =
    BT_LE_ADV_PARAM(BT_LE_ADV_OPT_CONNECTABLE | BT_LE_ADV_OPT_USE_NAME,
                    BT_GAP_ADV_FAST_INT_MIN_2,
                    BT_GAP_ADV_FAST_INT_MAX_2,
                    NULL);
static struct bt_le_ext_adv_start_param ext_adv_start_param = BT_LE_EXT_ADV_START_PARAM_INIT((0), (3));

Explanation

The advertisement data changes constantly during runtime. For this purpose, the above function is called every second in a main loop, depending on whether the last data packet was sent or not.

As I understand it, it is not enough to simply set the new data, because this is then not automatically advertised.

The BT_ADV_ENABLED check is redundant here, bt_le_ext_adv_start also does the same - this is known.

Notable

If I omit the following code, my application runs normally, but won't advertise anymore.

if(atomic_test_bit(adv->flags, BT_ADV_ENABLED))
{
	return;
}
bt_le_ext_adv_start(adv, &ext_adv_start_param);

As I said, the application always crashes after approx. 34 minutes. That is around 2040 seconds, i.e. almost 2048 = 2^11. This cannot be a coincidence.

And again, up to NCS v2.4 everything works without any problems.

We have observed this behavior on all our boards so far.

Parents
  • Hello,

    We have started to look into this and have some questions:

    • Do you use a 32.768 kHz crystal or the LFRC as low frequency clock source?
    • Is a connection established at any point, or is the device only starting and stopping advertising when this happens?
    • Do you by any chance use the mpsl_clock API, for instance by calling mpsl_clock_hfclk_request() and/or mpsl_clock_hfclk_release()?
    • Is it possible for you to share code that reproduce this that we can test on our end?
  • Do you use a 32.768 kHz crystal or the LFRC as low frequency clock source?

    We use the internal RC as LFCLK.

    Is a connection established at any point, or is the device only starting and stopping advertising when this happens?

    No connections are established during these tests. We only use the nRF Connect app for android to check which advertisement data is sent from the device.
    The advertisement itself works fine until the device crashes after approx. 34 minutes.

    Do you by any chance use the mpsl_clock API, for instance by calling mpsl_clock_hfclk_request() and/or mpsl_clock_hfclk_release()?

    We don't use any functionalities of the MPSL library, but this does not necessarily mean that none of the used interfaces may use them after all. We do not know that.

    Among other things we use:

    • SPI Slave
    • UART
    • RTC
    • TIMER
    • External Flash via QSPI
    • Nordic UART Service (NUS)
    • Bootloader Chain
    • etc
    Is it possible for you to share code that reproduce this that we can test on our end?

    I will try to reproduce the error in a minimal project. You can imagine that this is not so easy, as I only know whether the problem exists after about 34 minutes.

    I am also busy with other tasks, so it may well take until the end of next week.

  • Hi,

    Thank you. We are looking forward to hopefully getting code from you that reproduce it. In the mean time, we have some additional questions:

    • Can you try to change CONFIG_CLOCK_CONTROL_NRF_CALIBRATION_PERIOD to 250 (from the default 4000) and see if the issue happens quicker then?
    • Can you conect with a debugger once the assert has happened, and use "west attach" then "run bt" and upload the resulting back trace here?
    • Is the application touching the POWR_CLOCK peripheral in any way, or modifiing the
      POWER_CLOCK_IRQn in NVIC?
Reply
  • Hi,

    Thank you. We are looking forward to hopefully getting code from you that reproduce it. In the mean time, we have some additional questions:

    • Can you try to change CONFIG_CLOCK_CONTROL_NRF_CALIBRATION_PERIOD to 250 (from the default 4000) and see if the issue happens quicker then?
    • Can you conect with a debugger once the assert has happened, and use "west attach" then "run bt" and upload the resulting back trace here?
    • Is the application touching the POWR_CLOCK peripheral in any way, or modifiing the
      POWER_CLOCK_IRQn in NVIC?
Children
  • Hi,

    Just a short follow-up from me while waiting for your input. We have reproduced the same assert in a simple application that just starts advertising, and then calls NVIC_DisableIRQ(POWER_CLOCK_IRQn). Our theory is that somethign similar to this happens in your application. (Maybe not exactly that Clock interrupt is disabled in NVIC, but something else that makes clock interrupt not get to the MPSL interrupt handler).

  • First of all, thanks for the tip with CONFIG_CLOCK_CONTROL_NRF_CALIBRATION_PERIOD.
    This actually reduced the time from approx. 34 minutes to approx. 2 minutes.

    I then started to systematically remove individual systems from our application.

    It's embarrassing, but I found the error in our codebase.

    From NCS v2.4 to v2.5, something also changed with the SPI drivers.
    I have to mention that we are not using Zephyr's drivers, because the overhead and the event notification is not sufficient and fast enough for our time-critical application.
    Therefore, we directly use the nrfx drivers for SPI.

    Originally we had the following code for our initialization (snippet):

    IRQ_DIRECT_CONNECT(SPI0_TWI0_IRQn, 0, nrfx_spis_0_irq_handler, 0);
    irq_enable(SPI0_TWI0_IRQn);
    
    spi_instance = NRFX_SPIS_INSTANCE(0);
    nrfx_spis_config_t spi_config =
    NRFX_SPIS_DEFAULT_CONFIG(spi_pins.sck, spi_pins.mosi, spi_pins.miso, spi_pins.csn);
    spi_config.orc = 0x99u;
    spi_config.def = 0x66u;
    spi_config.mode = NRF_SPIS_MODE_3;
    spi_config.bit_order = NRF_SPIS_BIT_ORDER_MSB_FIRST;
    spi_config.miso_drive = NRF_GPIO_PIN_H0H1;
    
    nrfx_err_t spi_err = nrfx_spis_init(&spi_instance, &spi_config, spi0_event, NULL);
    if(spi_err != NRFX_SUCCESS)
    {
    BT_LOG_ERR("got spi error: %8x\n", spi_err);
    }

    Apparently something has changed in the declaration of the interrupt handler, which is why we have replaced the first line with:

    IRQ_DIRECT_CONNECT(SPI0_TWI0_IRQn, 0, NRFX_SPIS_INST_HANDLER_GET(0), 0);

    However, with NCS v2.5.2 we always received the following spi error during initialization:

    <err> service: got spi error:  bad0005


    Which indicates an invalid state of my SPI instance.
    It seems that the SPI slave has now been pre-initialized by Zephyr or whatever - I don't know.

    Therefore we added a call to nrfx_spis_uninit so that our code looked like this:

    IRQ_DIRECT_CONNECT(SPI0_TWI0_IRQn, 0, NRFX_SPIS_INST_HANDLER_GET(0), 0);
    irq_enable(SPI0_TWI0_IRQn);
    
    nrfx_spis_uninit(&spi_instance);
    
    spi_instance = NRFX_SPIS_INSTANCE(0);
    
    ...
    
    nrfx_err_t spi_err = nrfx_spis_init(&spi_instance, &spi_config, spi0_event, NULL);

    You will notice an error there, just as I did when removing the code parts, that nrfx_spis_uninit is called with an uninitialized instance.

    My guess is that since we are using SPI0 as a slave, it is resetting the state of the correct instance, as the spi error then no longer occurs. Probably only other configurations are somewhat malformed.

    After fixing the sequence, the MPSL Hard Fault no longer occurs.

    It is nevertheless funny that none of our team noticed this error and SPI still simply worked.
    It is also interesting that SPI was only initialized during the tests but never used and yet the error always occurred in combination with the advertisement.

  • Hi,

    Thank you for sharing the details of your investigation and findings. I is good to know you solved it. (And to find that it was not due to a bug in the MPSL).

Related