This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

SDK for Meshv v4.1.0 :: access_model_publish() triggers an assert on bearer_event_in_correct_irq_priority() .... but same code is in prod on SDK for Meshv v3.2.0

Sequence of events:

1. A GPIOTE button interrupt in LPN schedules a handler routine.

2. Handler routine (called by app_scheduler()) makes a call to access_model_publish() (to send an unacknowledged message).

3. Somewhere in the call stack for access_model_publish(), the assert in mesh/core/src/timer_scheduler.c:214 is triggered.

It seems we are failing the check in bearer_event_in_correct_irq_priority(), but this call is made from a function called by app_scheduler() (aka: outside of any interrupt context).

To try to debug the issue further, I printed the interrupt values:

volatile IRQn_Type active_irq = hal_irq_active_get();
volatile uint32_t prio = NVIC_GetPriority(active_irq);

/* log, then wait 5ms for buffer to flush before triggering assert */
LOG_FLUSH("active irq %d with NVIC priority %d\n", active_irq, prio);
/* log output:
 * "active irq -16 with NVIC priority 0"
 */

/* assert triggered here */
access_model_publish(/* ... */);

This code builds as-is with no warnings (using -Wall and -Wextra) on both SDK for Meshv v3.2.0 and v4.1.0.

This code runs with no issues (and is in production) on v3.2.0, but on v4.1.0 throws the above error.

Full build spec for v3.2.0:

-- Configuring CMake for nRF5 SDK for Bluetooth Mesh 3.2.0
-- SDK: nRF5_SDK_15.3.0_59ac345
-- Platform: nrf52832_xxAA
-- Arch: cortex-m4f
-- SoftDevice: s132_6.1.1
-- The C compiler identification is GNU 8.3.1

Full build spec for v4.1.0:

-- Configuring CMake for nRF5 SDK for Bluetooth Mesh 4.1.0
-- SDK: nRF5_SDK_16.0.0_98a08e2
-- Platform: nrf52832_xxAA
-- Arch: cortex-m4f
-- SoftDevice: s132_7.0.1
-- Board: pca10040
-- The C compiler identification is GNU 8.3.1

Parents
  • Hi Sirio, 

    Could you add a breakpoint inside bearer_event_in_correct_irq_priority() or put the log there to print out what caused the function return 0. 

    Do you have a minimal example with the LPN example that reproduce the issue so we can test here ?

  • Thank you for taking a look at this.

    I added some log prints, so bearer_event_in_correct_irq_priority() looks like this:

    bool bearer_event_in_correct_irq_priority(void)
    {
        volatile IRQn_Type active_irq = hal_irq_active_get();
        BBN_LOG_FLUSH("active_irq %d\n", hal_irq_active_get());
        BBN_LOG_FLUSH("hal_irq_is_enabled(EVENT_IRQn) %d\n", hal_irq_is_enabled(EVENT_IRQn));
        BBN_LOG_FLUSH("m_irq_priority %d\n", m_irq_priority);
        if (active_irq == HAL_IRQn_NONE)
        {
            return (m_irq_priority == NRF_MESH_IRQ_PRIORITY_THREAD || !hal_irq_is_enabled(EVENT_IRQn));
        }
        else
        {
            volatile uint32_t prio = NVIC_GetPriority(active_irq);
            BBN_LOG_FLUSH("prio %d\n", prio);
            return (prio == m_irq_priority || (prio > m_irq_priority && !hal_irq_is_enabled(EVENT_IRQn)));
        }
    }

    During provisioning (with proxy enabled), I am seeing:

    <t:    1398790>, bearer_event.c,  419, active_irq 22
    <t:    1398958>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:    1399126>, bearer_event.c,  421, m_irq_priority 6
    <t:    1399294>, bearer_event.c,  429, prio 6

    When establishing friendship (proxy now off), I am seeing:

    <t:   16130599>, bearer_event.c,  419, active_irq 18
    <t:   16130773>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:   16130941>, bearer_event.c,  421, m_irq_priority 6
    <t:   16131109>, bearer_event.c,  429, prio 6

    When calling access_model_publish() (to send an unacknowledged message to the mesh), when the assert is triggered, I am seeing:

    <t:   16131294>, bearer_event.c,  419, active_irq -16
    <t:   16131462>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:   16131630>, bearer_event.c,  421, m_irq_priority 6

    Unfortunately the amount of time needed to try and reproduce this product's functionality in the LPN example makes that a no-go.

  • My apologies about bbn_utilities.h. I cannot share that file, but I have replaced calls to it with calls to analogous functions in the mesh SDK (has no influence on the current bug).

    Please see attached 'lpn_irq_issue2.zip'.

    Thanks for going over this, I'm going to break your comments down into numbered sections:

    1. You are correct, I am showing a working app with NRF_MESH_IRQ_PRIORITY_LOWEST.

    The problem is exactly in trying to convert this to NRF_MESH_IRQ_PRIORITY_THREAD. If you attempt the conversion on this project you will realize:

    - there is a problem with app_flash_stabilize().

    - *everything* needs to be taken out of interrupt context and into scheduler context, which adds a fair amount of complexity.

    2. You are correct that this example calls a mesh function inside state_on_timer_handler(). However, we are *also* calling mesh from a scheduler context, in bstate_toggle().

    This correctly simulates the current situation: most times it is OK to call a mesh function from an interrupt handler, but sometimes we can't and need to call from a scheduler context.

    There are times we need to run code which takes too long to be executed in an interrupt handler.

    Taking too long can be bad in interrupt handlers because then other hardware interrupts are made to wait. Saving to flash is an example of something that takes a long time, I am sure you can think of other examples.

    3. You will notice this example runs with no problems at all if compiled with -DNDEBUG.

    So we return to the important questions:

    - Why is this a DEBUG assert and not a general assert? Is there *actually* a problem when client functions are called outside an interrupt context?

    - If possible interruptions are a problem, can we just temporarily disable the interrupt handler when we call a client function and then re-enable as soon as it returns?

    - If we can't just disable the handler, what actual difference in the chip operating modes would cause a problem? It appears that every code path through access_model_publish() results in allocation and a memory copy ... so by the time these functions return there is no longer any dependency on the caller's stack (all data has been copied away).

    - If there really is legitimate issue here and we must migrate to NRF_MESH_IRQ_PRIORITY_THREAD, how would we do that with this example project so that it works and does not hang?

    lpn_irq_issue2.zip

  • Hi Sirio, 

    I have been checking with our mesh developer about your issue. I can quote the discussion here: 

    1. Is this statement correct: One should always develop and test without -DNDEBUG, and only enable it at the release phase ? The DEBUG assert should be considered as the same normal assert ?

    > Yes. They seem to be super confident developers if they don't need any debug during development Smile 

     2.When there is an application that requires both action inside interrupt handler and thread context, it should always put the task from the interrupt handler into app_scheduler and execute all task inside main context ?

    > Yes. â€‹Once the app scheduler is used for using mesh stack APIs in a thread priority, the API calls must be routed through app_scheduler, or should be done in main() thread (it is one and the same, since app scheduler is triggered from main loop).

    3. Can the Flash API be called at lower context ?For example we configure the mesh stack as NRF_MESH_IRQ_PRIORITY_LOWEST but we execute flash task inside main context ?I can see we doing that in our example when initialize.

    > No. This is not possible; it will trigger the same assertion. If you look closely, flash APIs are called before any dynamic activity begins (i.e. before bearer handler is started). The APIs called during initialization don't really trigger any action inside the core modules for flash immediately. They simply queue-up the actions until mesh stack's own scheduler (bearer handler) is started. When the bearer handler is running, calling flash related APIs "may" trigger immediate flash actions if the queue is empty. This will cause interrupt context priority assertions if these (i.e. time-critical) API calls are not made from the correct priority. Therefore, as a rule of Thumb, once mesh stack is operated from thread priority all APIs should be called from that priority. 

    So, from this discussion, my thought regarding your issue is: 

    - Your code didn't have any issue with SDK v3.2, was because you used -DNDEBUG and the assertions were masked out. But there's still potential error that didn't get caught and in many case ended up in a reset that went unnoticed. 

    - Even though it will be quite painful but I think you should move all of the mesh APIs inside interrupt handler to be called from app_scheduler instead. If this solution is not acceptable, then another option is initialize the mesh stack in NRF_MESH_IRQ_PRIORITY_LOWEST  and call the flash API inside a software interrupt, or use a hardware that is not used to switch context. As you can find here, the mesh stack uses QDEC hardware interrupt for its own software activity. You can think of using something similar. So instead of execute the flash activity right inside the interrupt handler, you scheduler it in app_scheduler. And when you know that it's the period when there isn't much activity happens, you can trigger the dummy interrupt and then inside that interrupt handler you do flash activity. 

  • Ok, thank you very much for the detailed explanation and I understand.

    Looking at mesh examples I don't see any which implement NRF_MESH_IRQ_PRIORITY_THREAD.

    Is there any way we can get an example of how Nordic recommends we use NRF_MESH_IRQ_PRIORITY_THREAD for reference?

  • Hi Sirio, 

    I checked with the team, but don't have any example that use NRF_MESH_IRQ_PRIORITY_THREAD for now, unfortunately. 

    I can try to make an example but I can't promise on the timeline. 

  • Ok, thank you.

    Any help you can give on this is much appreciated.

Reply Children
Related