MPSL assert during MCUmgr image erase

In my project I use MPSL for a combination of BLE, ESB and flash sync.

The BLE/ESB timeslot handling was implemented following this example: https://github.com/too1/ncs-esb-ble-mpsl-demo

During normal operation it is running perfectly fine (so far).

However, when using MCUmgr (MCUmgr web in my case) and executing the "erase" command for an image in the slot, i get an MPSL assert:

rtt:~$ rtt:~$ [00:26:22.662,780] <err> mpsl_init: MPSL ASSERT: 106, 477
rtt:~$ [00:26:22.662,811] <err> os: ***** HARD FAULT *****
rtt:~$ [00:26:22.662,841] <err> os:   Fault escalation (see below)
rtt:~$ [00:26:22.662,872] <err> os: ARCH_EXCEPT with reason 3
 
rtt:~$ [00:26:22.662,902] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x00015393
rtt:~$ [00:26:22.662,933] <err> os: r3/a4:  0x00000000 r12/ip:  0x200074c0 r14/lr:  0xffffffff
rtt:~$ [00:26:22.662,963] <err> os:  xpsr:  0x61000000
rtt:~$ [00:26:22.663,024] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00037f59  s[ 2]:  0x00000000  s[ 3]:  0x20002398
rtt:~$ [00:26:22.663,085] <err> os: s[ 4]:  0x2000239a  s[ 5]:  0x00020a9d  s[ 6]:  0x00000000  s[ 7]:  0x00000000
rtt:~$ [00:26:22.663,146] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00037f4f  s[10]:  0x00000000  s[11]:  0x00000000
rtt:~$ [00:26:22.663,208] <err> os: s[12]:  0x00061ff8  s[13]:  0x2001587c  s[14]:  0x00000000  s[15]:  0x2001587c
rtt:~$ [00:26:22.663,238] <err> os: fpscr:  0x00000000
rtt:~$ [00:26:22.663,269] <err> os: Faulting instruction address (r15/pc): 0x00037f86
rtt:~$ [00:26:22.663,299] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
rtt:~$ [00:26:22.663,330] <err> os: Current thread: 0x20008460 (MPSL Work)
rtt:~$ [00:26:23.155,944] <err> os: Halting system

Of course the filename 106 and line number 477 don't really tell me much about what triggered it, and since the MPSL lib is closed source there's no way to find out.

I hope you can provide me with the details of where this assert was triggered and perhaps some pointers to what i could look at, to prevent this from happening

Parents
  • Hi,

    Which SDK version were you using when you got this assert (the line numbers for the same assert may differ between versions)?

    That said, this assert is likely in the handling of the MPSL_TIMESLOT_SIGNAL_CANCELLED event. This statement from the MPSL API documentation is likely relevant: "For signals executed in low priority returning any action other than MPSL_TIMESLOT_SIGNAL_ACTION_NONE will result in an assert." The proper way of requesting an event when processing a low priority signal is to invoke mpsl_timeslot_request().

  • Thanks for your reply!

    This was using NCS version 2.6.0, but i got the same using 2.5.1.

    Here are the relevant parts of my timeslot handler, where I do call mpsl_timeslot_request() upon receiving the MPSL_TIMESLOT_SIGNAL_CANCELLED event;

    extern "C" void RADIO_IRQHandler(void);
    
    mpsl_timeslot_request_t esb_timeslot_request{
        .request_type = MPSL_TIMESLOT_REQ_TYPE_EARLIEST,
        .params{
            .earliest{
                .hfclk = MPSL_TIMESLOT_HFCLK_CFG_NO_GUARANTEE,
                .priority = MPSL_TIMESLOT_PRIORITY_NORMAL,
                .length_us = kTimeslotLengthUs,
                .timeout_us = kTimeslotReqTimeoutUs,
            },
        },
    };
    
    mpsl_timeslot_signal_return_param_t* MyClass::TimeslotEventHandler(uint8_t session_id, uint32_t signal) {
      static bool timeslot_extenstion_failed = false;
    
      auto& inst = MyClass::GetInstance();
      inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_NONE;
    
      switch (signal) {
        case MPSL_TIMESLOT_SIGNAL_START: {
          timeslot_extenstion_failed = false;
    
          NVIC_ClearPendingIRQ(RADIO_IRQn);
          NRF_RADIO->POWER = RADIO_POWER_POWER_Disabled << RADIO_POWER_POWER_Pos;
          NRF_RADIO->POWER = RADIO_POWER_POWER_Enabled << RADIO_POWER_POWER_Pos;
          NVIC_ClearPendingIRQ(RADIO_IRQn);
    
          nrf_timer_bit_width_set(NRF_TIMER0, NRF_TIMER_BIT_WIDTH_32);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL0, kTimerExpiryUsEarly);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL1, kTimerExpiryUsReq);
    
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE0_MASK);
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE1_MASK);
    
          inst.SetTimeslotActive(true);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_TIMER0: {
          if (nrf_timer_event_check(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE0)) {
            nrf_timer_int_disable(NRF_TIMER0, NRF_TIMER_INT_COMPARE0_MASK);
            nrf_timer_event_clear(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE0);
    
            inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_EXTEND;
            inst.timeslot_action_.params.extend.length_us = kTimeslotLengthUs;
          } else if (nrf_timer_event_check(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE1)) {
            nrf_timer_int_disable(NRF_TIMER0, NRF_TIMER_INT_COMPARE1_MASK);
            nrf_timer_event_clear(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE1);
    
            if (timeslot_extenstion_failed) {
              inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_REQUEST;
              inst.timeslot_action_.params.request.p_next = &esb_timeslot_request;
            }
          }
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_EXTEND_SUCCEEDED: {
          // next trigger time is current + timer expiry early
          uint32_t current_cc = nrf_timer_cc_get(NRF_TIMER0, NRF_TIMER_CC_CHANNEL0);
          nrf_timer_bit_width_set(NRF_TIMER0, NRF_TIMER_BIT_WIDTH_32);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL0, (current_cc + kTimeslotLengthUs));
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE0_MASK);
    
          current_cc = nrf_timer_cc_get(NRF_TIMER0, NRF_TIMER_CC_CHANNEL1);
          nrf_timer_bit_width_set(NRF_TIMER0, NRF_TIMER_BIT_WIDTH_32);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL1, (current_cc + kTimeslotLengthUs));
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE1_MASK);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_EXTEND_FAILED: {
          timeslot_extenstion_failed = true;
          inst.SetTimeslotActive(false);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_RADIO: {
          // radio signal received, manually call RADIO IRQ / disable IRQ
          if (inst.timeslot_active_) {
            RADIO_IRQHandler();
          } else {
            NVIC_ClearPendingIRQ(RADIO_IRQn);
            NVIC_DisableIRQ(RADIO_IRQn);
          }
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_CANCELLED:  // [[fallthrough]];
        case MPSL_TIMESLOT_SIGNAL_BLOCKED: {
          inst.SetTimeslotActive(false);
          mpsl_timeslot_request(session_id, &esb_timeslot_request);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_OVERSTAYED:  // [[fallthrough]];
        case MPSL_TIMESLOT_SIGNAL_INVALID_RETURN: {
          inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_END;
          inst.SetTimeslotActive(false);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_SESSION_IDLE:  // [[fallthrough]];
        case MPSL_TIMESLOT_SIGNAL_SESSION_CLOSED: {
          inst.SetTimeslotActive(false);
          break;
        }
        default: {
          return nullptr;
        }
      }
    
      return &inst.timeslot_action_;
    }

    With relevant class definition (just parts of it for this demonstration) in the header (note: using c++17)

    class MyClass {
     public:
      MyClass(const MyClass&) = delete;
      MyClass& operator=(const MyClass&) = delete;
      MyClass(MyClass&&) = delete;
      MyClass& operator=(MyClass&&) = delete;
    
      static inline MyClass& GetInstance() {
        static MyClass instance;
        return instance;
      }
    
     private:
      MyClass() = default;
      ~MyClass() = default;
    
      void SetTimeslotActive(bool active);
    
      static mpsl_timeslot_signal_return_param_t* TimeslotEventHandler(uint8_t session_id, uint32_t signal);
    
      bool timeslot_active_{false};
      mpsl_timeslot_signal_return_param_t timeslot_action_{};
    };


    The biggest difference with the example here: https://github.com/too1/ncs-esb-ble-mpsl-demo/blob/master/common/timeslot_handler.c#L183 Is that the request is done from a non-preemptible thead, while I just called mpsl_timeslot_request() directly from the timeslot handler. Could this be the reason for the assert as well?

    I wasn't really sure whether this was necessary, since i would have to create a new thread and message queue for it, which takes up extra stack space. In case it really is required, could i also use the system workqueue to schedule the request? It has a cooperative priority

Reply
  • Thanks for your reply!

    This was using NCS version 2.6.0, but i got the same using 2.5.1.

    Here are the relevant parts of my timeslot handler, where I do call mpsl_timeslot_request() upon receiving the MPSL_TIMESLOT_SIGNAL_CANCELLED event;

    extern "C" void RADIO_IRQHandler(void);
    
    mpsl_timeslot_request_t esb_timeslot_request{
        .request_type = MPSL_TIMESLOT_REQ_TYPE_EARLIEST,
        .params{
            .earliest{
                .hfclk = MPSL_TIMESLOT_HFCLK_CFG_NO_GUARANTEE,
                .priority = MPSL_TIMESLOT_PRIORITY_NORMAL,
                .length_us = kTimeslotLengthUs,
                .timeout_us = kTimeslotReqTimeoutUs,
            },
        },
    };
    
    mpsl_timeslot_signal_return_param_t* MyClass::TimeslotEventHandler(uint8_t session_id, uint32_t signal) {
      static bool timeslot_extenstion_failed = false;
    
      auto& inst = MyClass::GetInstance();
      inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_NONE;
    
      switch (signal) {
        case MPSL_TIMESLOT_SIGNAL_START: {
          timeslot_extenstion_failed = false;
    
          NVIC_ClearPendingIRQ(RADIO_IRQn);
          NRF_RADIO->POWER = RADIO_POWER_POWER_Disabled << RADIO_POWER_POWER_Pos;
          NRF_RADIO->POWER = RADIO_POWER_POWER_Enabled << RADIO_POWER_POWER_Pos;
          NVIC_ClearPendingIRQ(RADIO_IRQn);
    
          nrf_timer_bit_width_set(NRF_TIMER0, NRF_TIMER_BIT_WIDTH_32);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL0, kTimerExpiryUsEarly);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL1, kTimerExpiryUsReq);
    
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE0_MASK);
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE1_MASK);
    
          inst.SetTimeslotActive(true);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_TIMER0: {
          if (nrf_timer_event_check(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE0)) {
            nrf_timer_int_disable(NRF_TIMER0, NRF_TIMER_INT_COMPARE0_MASK);
            nrf_timer_event_clear(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE0);
    
            inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_EXTEND;
            inst.timeslot_action_.params.extend.length_us = kTimeslotLengthUs;
          } else if (nrf_timer_event_check(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE1)) {
            nrf_timer_int_disable(NRF_TIMER0, NRF_TIMER_INT_COMPARE1_MASK);
            nrf_timer_event_clear(NRF_TIMER0, NRF_TIMER_EVENT_COMPARE1);
    
            if (timeslot_extenstion_failed) {
              inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_REQUEST;
              inst.timeslot_action_.params.request.p_next = &esb_timeslot_request;
            }
          }
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_EXTEND_SUCCEEDED: {
          // next trigger time is current + timer expiry early
          uint32_t current_cc = nrf_timer_cc_get(NRF_TIMER0, NRF_TIMER_CC_CHANNEL0);
          nrf_timer_bit_width_set(NRF_TIMER0, NRF_TIMER_BIT_WIDTH_32);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL0, (current_cc + kTimeslotLengthUs));
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE0_MASK);
    
          current_cc = nrf_timer_cc_get(NRF_TIMER0, NRF_TIMER_CC_CHANNEL1);
          nrf_timer_bit_width_set(NRF_TIMER0, NRF_TIMER_BIT_WIDTH_32);
          nrf_timer_cc_set(NRF_TIMER0, NRF_TIMER_CC_CHANNEL1, (current_cc + kTimeslotLengthUs));
          nrf_timer_int_enable(NRF_TIMER0, NRF_TIMER_INT_COMPARE1_MASK);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_EXTEND_FAILED: {
          timeslot_extenstion_failed = true;
          inst.SetTimeslotActive(false);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_RADIO: {
          // radio signal received, manually call RADIO IRQ / disable IRQ
          if (inst.timeslot_active_) {
            RADIO_IRQHandler();
          } else {
            NVIC_ClearPendingIRQ(RADIO_IRQn);
            NVIC_DisableIRQ(RADIO_IRQn);
          }
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_CANCELLED:  // [[fallthrough]];
        case MPSL_TIMESLOT_SIGNAL_BLOCKED: {
          inst.SetTimeslotActive(false);
          mpsl_timeslot_request(session_id, &esb_timeslot_request);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_OVERSTAYED:  // [[fallthrough]];
        case MPSL_TIMESLOT_SIGNAL_INVALID_RETURN: {
          inst.timeslot_action_.callback_action = MPSL_TIMESLOT_SIGNAL_ACTION_END;
          inst.SetTimeslotActive(false);
          break;
        }
        case MPSL_TIMESLOT_SIGNAL_SESSION_IDLE:  // [[fallthrough]];
        case MPSL_TIMESLOT_SIGNAL_SESSION_CLOSED: {
          inst.SetTimeslotActive(false);
          break;
        }
        default: {
          return nullptr;
        }
      }
    
      return &inst.timeslot_action_;
    }

    With relevant class definition (just parts of it for this demonstration) in the header (note: using c++17)

    class MyClass {
     public:
      MyClass(const MyClass&) = delete;
      MyClass& operator=(const MyClass&) = delete;
      MyClass(MyClass&&) = delete;
      MyClass& operator=(MyClass&&) = delete;
    
      static inline MyClass& GetInstance() {
        static MyClass instance;
        return instance;
      }
    
     private:
      MyClass() = default;
      ~MyClass() = default;
    
      void SetTimeslotActive(bool active);
    
      static mpsl_timeslot_signal_return_param_t* TimeslotEventHandler(uint8_t session_id, uint32_t signal);
    
      bool timeslot_active_{false};
      mpsl_timeslot_signal_return_param_t timeslot_action_{};
    };


    The biggest difference with the example here: https://github.com/too1/ncs-esb-ble-mpsl-demo/blob/master/common/timeslot_handler.c#L183 Is that the request is done from a non-preemptible thead, while I just called mpsl_timeslot_request() directly from the timeslot handler. Could this be the reason for the assert as well?

    I wasn't really sure whether this was necessary, since i would have to create a new thread and message queue for it, which takes up extra stack space. In case it really is required, could i also use the system workqueue to schedule the request? It has a cooperative priority

Children
  • I see. Then I can confirm that the assert is because you of an action other than NONE in the handling of a MPSL_TIMESLOT_SIGNAL_CANCELLED event. Defering the call to mpsl_timeslot_request() with the help of the system workqueue as you suggest should resolve this.

  • I have changed it to schedule calls to mpsl_timeslot_session_open() and mpsl_timeslot_request() from the system workqueue (which is a non-preemptible thread) and it has solved the MPSL assert issue. I tested it by uploading and then erasing an image using mcumgr-web about 20 times, and no crashes happened. Without help of the system workqueue, the crash instantly happens.

    I will mark this as resolved, thanks!

    One more thing though (which may or may not be related to this), one time when doing the "erase" command on slot 1 from mcumgr-web, and mcumgr-web was trying to reconnect to the device, i got the following hardfault:

    [00:05:57.548,004] <inf> mcuboot_util: Image index: 0, Swap type: none
    [00:06:07.358,276] <inf> mcuboot_util: Image index: 0, Swap type: none
    [00:06:12.778,442] <err> bt_conn: not connected!
    [00:06:12.778,808] <inf> bas: BAS Notifications disabled
    [00:06:12.780,700] <inf> network: BLE Disconnect!
    [00:06:12.780,731] <inf> thread_event_reg: BroadcastMessage = kBleConnectionEvent
    [00:06:15.850,006] <err> os: ***** BUS FAULT *****
    [00:06:15.850,036] <err> os:   Imprecise data bus error
    [00:06:15.850,097] <err> os: r0/a1:  0x0114043e  r1/a2:  0x200081eb  r2/a3:  0x00000006
    [00:06:15.850,128] <err> os: r3/a4:  0x0114043f r12/ip:  0x00000003 r14/lr:  0x00011c57
    [00:06:15.850,158] <err> os:  xpsr:  0x21000000
    [00:06:15.850,219] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x20015af4  s[ 3]:  0x20007d28
    [00:06:15.850,280] <err> os: s[ 4]:  0x00000848  s[ 5]:  0x000000ff  s[ 6]:  0x00000000  s[ 7]:  0x0001a91f
    [00:06:15.850,341] <err> os: s[ 8]:  0x01005ae8  s[ 9]:  0x20001d01  s[10]:  0x20007d28  s[11]:  0x001b0001
    [00:06:15.850,402] <err> os: s[12]:  0x001b0848  s[13]:  0x00010848  s[14]:  0x00000001  s[15]:  0x200017fc
    [00:06:15.850,402] <err> os: fpscr:  0x20001d9c
    [00:06:15.850,463] <err> os: Faulting instruction address (r15/pc): 0x00018a80
    [00:06:15.850,494] <err> os: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0
    [00:06:16.288,940] <err> os: Halting system

    Where doing addr2line on 0x00011c57 returned:

    0x00011c57
    libsoftdevice_controller_s112_debug_hardfp__obfuscated.elf:?

    To me it seemed like a Bluetooth issue, and then especially related to the host device, since it had trouble connecting to the board in the first place, and also the following in the logs seems suspicious:

    bt_conn: not connected!

    Switched to a laptop with a proper functioning Bluetooth chip, and i did not see this error again.

    Just wanted to mention it just in case, if it ever happens to a user it would be nice to handle it properly

Related