This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

SDK for Meshv v4.1.0 :: access_model_publish() triggers an assert on bearer_event_in_correct_irq_priority() .... but same code is in prod on SDK for Meshv v3.2.0

Sequence of events:

1. A GPIOTE button interrupt in LPN schedules a handler routine.

2. Handler routine (called by app_scheduler()) makes a call to access_model_publish() (to send an unacknowledged message).

3. Somewhere in the call stack for access_model_publish(), the assert in mesh/core/src/timer_scheduler.c:214 is triggered.

It seems we are failing the check in bearer_event_in_correct_irq_priority(), but this call is made from a function called by app_scheduler() (aka: outside of any interrupt context).

To try to debug the issue further, I printed the interrupt values:

volatile IRQn_Type active_irq = hal_irq_active_get();
volatile uint32_t prio = NVIC_GetPriority(active_irq);

/* log, then wait 5ms for buffer to flush before triggering assert */
LOG_FLUSH("active irq %d with NVIC priority %d\n", active_irq, prio);
/* log output:
 * "active irq -16 with NVIC priority 0"
 */

/* assert triggered here */
access_model_publish(/* ... */);

This code builds as-is with no warnings (using -Wall and -Wextra) on both SDK for Meshv v3.2.0 and v4.1.0.

This code runs with no issues (and is in production) on v3.2.0, but on v4.1.0 throws the above error.

Full build spec for v3.2.0:

-- Configuring CMake for nRF5 SDK for Bluetooth Mesh 3.2.0
-- SDK: nRF5_SDK_15.3.0_59ac345
-- Platform: nrf52832_xxAA
-- Arch: cortex-m4f
-- SoftDevice: s132_6.1.1
-- The C compiler identification is GNU 8.3.1

Full build spec for v4.1.0:

-- Configuring CMake for nRF5 SDK for Bluetooth Mesh 4.1.0
-- SDK: nRF5_SDK_16.0.0_98a08e2
-- Platform: nrf52832_xxAA
-- Arch: cortex-m4f
-- SoftDevice: s132_7.0.1
-- Board: pca10040
-- The C compiler identification is GNU 8.3.1

Parents
  • Hi Sirio, 

    Could you add a breakpoint inside bearer_event_in_correct_irq_priority() or put the log there to print out what caused the function return 0. 

    Do you have a minimal example with the LPN example that reproduce the issue so we can test here ?

  • Thank you for taking a look at this.

    I added some log prints, so bearer_event_in_correct_irq_priority() looks like this:

    bool bearer_event_in_correct_irq_priority(void)
    {
        volatile IRQn_Type active_irq = hal_irq_active_get();
        BBN_LOG_FLUSH("active_irq %d\n", hal_irq_active_get());
        BBN_LOG_FLUSH("hal_irq_is_enabled(EVENT_IRQn) %d\n", hal_irq_is_enabled(EVENT_IRQn));
        BBN_LOG_FLUSH("m_irq_priority %d\n", m_irq_priority);
        if (active_irq == HAL_IRQn_NONE)
        {
            return (m_irq_priority == NRF_MESH_IRQ_PRIORITY_THREAD || !hal_irq_is_enabled(EVENT_IRQn));
        }
        else
        {
            volatile uint32_t prio = NVIC_GetPriority(active_irq);
            BBN_LOG_FLUSH("prio %d\n", prio);
            return (prio == m_irq_priority || (prio > m_irq_priority && !hal_irq_is_enabled(EVENT_IRQn)));
        }
    }

    During provisioning (with proxy enabled), I am seeing:

    <t:    1398790>, bearer_event.c,  419, active_irq 22
    <t:    1398958>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:    1399126>, bearer_event.c,  421, m_irq_priority 6
    <t:    1399294>, bearer_event.c,  429, prio 6

    When establishing friendship (proxy now off), I am seeing:

    <t:   16130599>, bearer_event.c,  419, active_irq 18
    <t:   16130773>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:   16130941>, bearer_event.c,  421, m_irq_priority 6
    <t:   16131109>, bearer_event.c,  429, prio 6

    When calling access_model_publish() (to send an unacknowledged message to the mesh), when the assert is triggered, I am seeing:

    <t:   16131294>, bearer_event.c,  419, active_irq -16
    <t:   16131462>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:   16131630>, bearer_event.c,  421, m_irq_priority 6

    Unfortunately the amount of time needed to try and reproduce this product's functionality in the LPN example makes that a no-go.

  • Hi Sirio, 

    I have been checking with our mesh developer about your issue. I can quote the discussion here: 

    1. Is this statement correct: One should always develop and test without -DNDEBUG, and only enable it at the release phase ? The DEBUG assert should be considered as the same normal assert ?

    > Yes. They seem to be super confident developers if they don't need any debug during development Smile 

     2.When there is an application that requires both action inside interrupt handler and thread context, it should always put the task from the interrupt handler into app_scheduler and execute all task inside main context ?

    > Yes. â€‹Once the app scheduler is used for using mesh stack APIs in a thread priority, the API calls must be routed through app_scheduler, or should be done in main() thread (it is one and the same, since app scheduler is triggered from main loop).

    3. Can the Flash API be called at lower context ?For example we configure the mesh stack as NRF_MESH_IRQ_PRIORITY_LOWEST but we execute flash task inside main context ?I can see we doing that in our example when initialize.

    > No. This is not possible; it will trigger the same assertion. If you look closely, flash APIs are called before any dynamic activity begins (i.e. before bearer handler is started). The APIs called during initialization don't really trigger any action inside the core modules for flash immediately. They simply queue-up the actions until mesh stack's own scheduler (bearer handler) is started. When the bearer handler is running, calling flash related APIs "may" trigger immediate flash actions if the queue is empty. This will cause interrupt context priority assertions if these (i.e. time-critical) API calls are not made from the correct priority. Therefore, as a rule of Thumb, once mesh stack is operated from thread priority all APIs should be called from that priority. 

    So, from this discussion, my thought regarding your issue is: 

    - Your code didn't have any issue with SDK v3.2, was because you used -DNDEBUG and the assertions were masked out. But there's still potential error that didn't get caught and in many case ended up in a reset that went unnoticed. 

    - Even though it will be quite painful but I think you should move all of the mesh APIs inside interrupt handler to be called from app_scheduler instead. If this solution is not acceptable, then another option is initialize the mesh stack in NRF_MESH_IRQ_PRIORITY_LOWEST  and call the flash API inside a software interrupt, or use a hardware that is not used to switch context. As you can find here, the mesh stack uses QDEC hardware interrupt for its own software activity. You can think of using something similar. So instead of execute the flash activity right inside the interrupt handler, you scheduler it in app_scheduler. And when you know that it's the period when there isn't much activity happens, you can trigger the dummy interrupt and then inside that interrupt handler you do flash activity. 

  • Ok, thank you very much for the detailed explanation and I understand.

    Looking at mesh examples I don't see any which implement NRF_MESH_IRQ_PRIORITY_THREAD.

    Is there any way we can get an example of how Nordic recommends we use NRF_MESH_IRQ_PRIORITY_THREAD for reference?

  • Hi Sirio, 

    I checked with the team, but don't have any example that use NRF_MESH_IRQ_PRIORITY_THREAD for now, unfortunately. 

    I can try to make an example but I can't promise on the timeline. 

  • Ok, thank you.

    Any help you can give on this is much appreciated.

  • The only answer found so far was to use SWI3_EGU3_IRQn to trigger an interrupt handler and call the mesh stack from within the handler.

Reply Children
No Data
Related