This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

SDK for Meshv v4.1.0 :: access_model_publish() triggers an assert on bearer_event_in_correct_irq_priority() .... but same code is in prod on SDK for Meshv v3.2.0

Sequence of events:

1. A GPIOTE button interrupt in LPN schedules a handler routine.

2. Handler routine (called by app_scheduler()) makes a call to access_model_publish() (to send an unacknowledged message).

3. Somewhere in the call stack for access_model_publish(), the assert in mesh/core/src/timer_scheduler.c:214 is triggered.

It seems we are failing the check in bearer_event_in_correct_irq_priority(), but this call is made from a function called by app_scheduler() (aka: outside of any interrupt context).

To try to debug the issue further, I printed the interrupt values:

volatile IRQn_Type active_irq = hal_irq_active_get();
volatile uint32_t prio = NVIC_GetPriority(active_irq);

/* log, then wait 5ms for buffer to flush before triggering assert */
LOG_FLUSH("active irq %d with NVIC priority %d\n", active_irq, prio);
/* log output:
 * "active irq -16 with NVIC priority 0"
 */

/* assert triggered here */
access_model_publish(/* ... */);

This code builds as-is with no warnings (using -Wall and -Wextra) on both SDK for Meshv v3.2.0 and v4.1.0.

This code runs with no issues (and is in production) on v3.2.0, but on v4.1.0 throws the above error.

Full build spec for v3.2.0:

-- Configuring CMake for nRF5 SDK for Bluetooth Mesh 3.2.0
-- SDK: nRF5_SDK_15.3.0_59ac345
-- Platform: nrf52832_xxAA
-- Arch: cortex-m4f
-- SoftDevice: s132_6.1.1
-- The C compiler identification is GNU 8.3.1

Full build spec for v4.1.0:

-- Configuring CMake for nRF5 SDK for Bluetooth Mesh 4.1.0
-- SDK: nRF5_SDK_16.0.0_98a08e2
-- Platform: nrf52832_xxAA
-- Arch: cortex-m4f
-- SoftDevice: s132_7.0.1
-- Board: pca10040
-- The C compiler identification is GNU 8.3.1

Parents
  • Hi Sirio, 

    Could you add a breakpoint inside bearer_event_in_correct_irq_priority() or put the log there to print out what caused the function return 0. 

    Do you have a minimal example with the LPN example that reproduce the issue so we can test here ?

  • Thank you for taking a look at this.

    I added some log prints, so bearer_event_in_correct_irq_priority() looks like this:

    bool bearer_event_in_correct_irq_priority(void)
    {
        volatile IRQn_Type active_irq = hal_irq_active_get();
        BBN_LOG_FLUSH("active_irq %d\n", hal_irq_active_get());
        BBN_LOG_FLUSH("hal_irq_is_enabled(EVENT_IRQn) %d\n", hal_irq_is_enabled(EVENT_IRQn));
        BBN_LOG_FLUSH("m_irq_priority %d\n", m_irq_priority);
        if (active_irq == HAL_IRQn_NONE)
        {
            return (m_irq_priority == NRF_MESH_IRQ_PRIORITY_THREAD || !hal_irq_is_enabled(EVENT_IRQn));
        }
        else
        {
            volatile uint32_t prio = NVIC_GetPriority(active_irq);
            BBN_LOG_FLUSH("prio %d\n", prio);
            return (prio == m_irq_priority || (prio > m_irq_priority && !hal_irq_is_enabled(EVENT_IRQn)));
        }
    }

    During provisioning (with proxy enabled), I am seeing:

    <t:    1398790>, bearer_event.c,  419, active_irq 22
    <t:    1398958>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:    1399126>, bearer_event.c,  421, m_irq_priority 6
    <t:    1399294>, bearer_event.c,  429, prio 6

    When establishing friendship (proxy now off), I am seeing:

    <t:   16130599>, bearer_event.c,  419, active_irq 18
    <t:   16130773>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:   16130941>, bearer_event.c,  421, m_irq_priority 6
    <t:   16131109>, bearer_event.c,  429, prio 6

    When calling access_model_publish() (to send an unacknowledged message to the mesh), when the assert is triggered, I am seeing:

    <t:   16131294>, bearer_event.c,  419, active_irq -16
    <t:   16131462>, bearer_event.c,  420, hal_irq_is_enabled(EVENT_IRQn) 1
    <t:   16131630>, bearer_event.c,  421, m_irq_priority 6

    Unfortunately the amount of time needed to try and reproduce this product's functionality in the LPN example makes that a no-go.

  • This is the compilation for v4.1.

    -- Configuring CMake for nRF5 SDK for Bluetooth Mesh 4.1.0
    -- SDK: nRF5_SDK_16.0.0_98a08e2
    -- Platform: nrf52832_xxAA
    -- Arch: cortex-m4f
    -- SoftDevice: s132_7.0.1
    -- Board: pca10040
    -- The C compiler identification is GNU 8.3.1
    
    lpn_irq_issue/src/main.c: In function 'main':
    lpn_irq_issue/src/main.c:666:5: warning: implicit declaration of function 'APP_SCHED_INIT'; did you mean 'APP_USBD_VID'? [-Wimplicit-function-declaration]
         APP_SCHED_INIT(APP_SCHED_EVENT_SIZE, APP_SCHED_QUEUE_SIZE);
         ^~~~~~~~~~~~~~
         APP_USBD_VID
    lpn_irq_issue/src/main.c:677:9: warning: implicit declaration of function 'app_sched_execute'; did you mean 'app_scheduler_init'? [-Wimplicit-function-declaration]
             app_sched_execute();
             ^~~~~~~~~~~~~~~~~
             app_scheduler_init
    
    nRF5_SDK_16.0.0_98a08e2/components/softdevice/common/nrf_sdh.c: In function 'SWI2_EGU2_IRQHandler':
    nRF5_SDK_16.0.0_98a08e2/components/softdevice/common/nrf_sdh.c:390:27: warning: implicit declaration of function 'app_sched_event_put'; did you mean 'appsh_events_poll'? [-Wimplicit-function-declaration]
         ret_code_t ret_code = app_sched_event_put(NULL, 0, appsh_events_poll);
                               ^~~~~~~~~~~~~~~~~~~
                               appsh_events_poll
    
    nrf5SDKforMeshv410src/external/app_timer/app_timer_mesh.c: In function 'timeout_handler_exec':
    nrf5SDKforMeshv410src/external/app_timer/app_timer_mesh.c:393:25: warning: implicit declaration of function 'app_sched_event_put'; did you mean 'app_scheduler_init'? [-Wimplicit-function-declaration]
         uint32_t err_code = app_sched_event_put(&timer_event, sizeof(timer_event), timeout_handler_scheduled_exec);
                             ^~~~~~~~~~~~~~~~~~~
                             app_scheduler_init
    
    FAILED: lpn_irq_issue_nrf52832_xxAA_s132_7.0.1.elf lpn_irq_issue_nrf52832_xxAA_s132_7.0.1.hex
    arm-none-eabi/bin/ld: lpn_irq_issue/CMakeFiles/lpn_irq_issue_nrf52832_xxAA_s132_7.0.1.dir/src/main.c.obj: in function `main':
    lpn_irq_issue/src/main.c:666: undefined reference to `APP_SCHED_INIT'
    
    

    However, if you examine diff #3 given in my previous post, you will see that I followed the documentation exactly and that:

    1. lpn_irq_issue/include/app_config.h contains:

    #define APP_SCHEDULER_ENABLED 1
    #define NRF_SDH_DISPATCH_MODEL 1
    #define APP_TIMER_CONFIG_USE_SCHEDULER 1
    #define APP_SCHED_EVENT_SIZE APP_TIMER_SCHED_EVENT_DATA_SIZE
    #define APP_SCHED_QUEUE_SIZE 32

    2. lpn_irq_issue/src/main.c contains:

    /* nRF5 SDK */
    #include "nrf_soc.h"
    #include "nrf_pwr_mgmt.h"
    #include "app_scheduler.h"

    3. lpn_irq_issue/CMakeLists.txt contains:

    add_executable(${target}
        "${CMAKE_CURRENT_SOURCE_DIR}/src/main.c"
    
        "${CMAKE_SOURCE_DIR}/mesh/stack/src/mesh_stack.c"
        "${CMAKE_SOURCE_DIR}/examples/common/src/mesh_app_utils.c"
        "${CMAKE_SOURCE_DIR}/examples/common/src/mesh_provisionee.c"
        "${CMAKE_SOURCE_DIR}/examples/common/src/rtt_input.c"
    
        "${CMAKE_SOURCE_DIR}/examples/common/src/simple_hal.c"
    
        "${SDK_ROOT}/components/libraries/scheduler/app_scheduler.c"

    and also contains:

    target_include_directories(${target} PUBLIC
        "${CMAKE_CURRENT_SOURCE_DIR}/include"
        "${CMAKE_SOURCE_DIR}/examples/common/include"
        "${CMAKE_SOURCE_DIR}/external/rtt/include"
        "${SDK_ROOT}/components/libraries/scheduler"
    

    So there may be an error in the SDK or the documentation?

    For future reference, if a bug report contains an exact Diff from a stock Nordic example that recreates the issue, it would be helpful to see you:

    - take the stock Nordic example

    - apply the diff

    - build the example

    If you prefer a different diff/patch format or a different bug report format that lets you more easily reproduce and run the build on your system, please let me know and I will provide that.

  • I don't think the compilation error has anything related to the change you made. It's complaining about the  missing definition  of app_sched_event_put() and APP_SCHED_INIT. Have you made sure you include path to app_scheduler.h in the project ? and include the file in main.c? 

    app_scheduler.c also needed in the project/Makefile. 

  • Thank you.

    Please review points 1-3 in my previous reply. From those you can see that:

    - "app_scheduler.h" in included in main.c

    - "app_scheduler.c" is added to the target executable in CMakeLists.txt

    - "${SDK_ROOT}/components/libraries/scheduler" is added to the include dirs in CMakeLists.txt

    - APP_SCHEDULER_ENABLED 1 is defined in app_config.h

    Diff #3 in my reply before the last includes all of these changes.

    Were you able to take the stock example and apply Diffs 1, 2 and 3 to them to reproduce the issue?

    I am sure it's probably a trivial problem, but from following the documentation exactly (see above) I'm not able to make it build.

    Would be great if you could tell me your outcome from applying the diffs and attempting the build on your side :)

  • Hi Sirio, 

    I have applied the patch and compiled without any issue. Please make sure you used the correct SDK_ROOT. 

    Attached is my lpn project. 

    My tip is to generate the SES project when you call cmake ninja (-DGENERATE_SES_PROJECTS=ON). It would be easier to debug

    lpn.zip

  • Thank you very much, your example helped me to find the problem:

    In SDK v3.2.0 (I am building against both so that I can verify migration of existing projects), there are 2 DIFFERENT files both named app_scheduler.h and both defining APP_SCHEDULER_H__

    - nRF5_SDK_16.0.0_98a08e2/components/libraries/scheduler/app_scheduler.h

    - nrf5SDKforMeshv320src/examples/common/include/app_scheduler.h

    ... this created the issue where a slight difference in include priority will result in the wrong file being included.

    I see you have already fixed this in the v4.1.0 SDK by removing examples/common/include/app_scheduler.h; very good, thank you.

    I will now try to migrate our code to run with this NRF_MESH_IRQ_PRIORITY_THREAD model.

Reply
  • Thank you very much, your example helped me to find the problem:

    In SDK v3.2.0 (I am building against both so that I can verify migration of existing projects), there are 2 DIFFERENT files both named app_scheduler.h and both defining APP_SCHEDULER_H__

    - nRF5_SDK_16.0.0_98a08e2/components/libraries/scheduler/app_scheduler.h

    - nrf5SDKforMeshv320src/examples/common/include/app_scheduler.h

    ... this created the issue where a slight difference in include priority will result in the wrong file being included.

    I see you have already fixed this in the v4.1.0 SDK by removing examples/common/include/app_scheduler.h; very good, thank you.

    I will now try to migrate our code to run with this NRF_MESH_IRQ_PRIORITY_THREAD model.

Children
  • Migrating my example to this NRF_MESH_IRQ_PRIORITY_THREAD model as recommended, I am having problems handling flash correctly.

    Specifically, I seem to loop forever waiting for flash stability.

    This code sample illustrates the problem:

        /* NOTE: error checking removed for readability */
        if (!m_device_provisioned)
        {
            static const uint8_t static_auth_data[NRF_MESH_KEY_SIZE] = STATIC_AUTH_DATA;
            mesh_provisionee_start_params_t prov_start_params = {
                .p_static_data    = static_auth_data,
                .prov_complete_cb = provisioning_complete_cb,
                .prov_device_identification_start_cb = device_identification_start_cb,
                .prov_device_identification_stop_cb = NULL,
                .prov_abort_cb = provisioning_aborted_cb,
                .p_device_uri = EX_URI_LPN
            };
            mesh_provisionee_prov_start(&prov_start_params);
        }
    
        mesh_app_uuid_print(nrf_mesh_configure_device_uuid_get());
    
        mesh_stack_start();
    
        app_flash_data_load(HANDLE, &variable, sizeof(variable));
        /* NOTE: several other app_flash_data_load() calls here */
    
        /* sometimes _not_ stable even though nothing pending */
        while (app_flash_pending || !flash_manager_is_stable())
        {
    #if NRF_SDH_DISPATCH_MODEL == NRF_SDH_DISPATCH_MODEL_INTERRUPT
            /* this works OK if NRF_SDH_DISPATCH_MODEL_INTERRUPT */
            ;
    #elif NRF_SDH_DISPATCH_MODEL == NRF_SDH_DISPATCH_MODEL_APPSH
            /* this hangs forever if NRF_SDH_DISPATCH_MODEL_APPSH */
            nrf_mesh_process();
            app_sched_execute();
    #endif
        }

  • Also note that the affected files are the SAME in both SDK 320 and 410:

    - mesh/core/src/bearer_event.c

    - mesh/core/src/timer_scheduler.c

    This means the *same* IRQ checking code passes in SDK320 and fails in SDK410.

    Does SDK320 put the caller in an interrupt context if calling from the app (non-interrupt)?

    As an additional note, it is still unclear to me why it would be an error to call the mesh stack from a LOWER (non-interrupt) priority (which is interrupted by the mesh stack if it needs to execute anything).

    This behavior was allowed in SDK320 and never caused any problems.

    Forcing the caller to be in an interrupt context to be able to call the mesh is a problem, for example:

    - when a GPIO or timer interrupt requires longer processing times and would miss the next GPIO/timer if waiting too long.

    - when handling a mesh message requires background processing like saving to flash, we want to acknowledge the message in the interrupt quickly (to avoid timing out) and schedule writing to flash in a non-interrupt context.

    - when logic requires saving to flash before calling a mesh function ... we do not want to save to flash from an interrupt context but now we can't call mesh stack outside of an interrupt context.

    It would make sense that the app (from a lower priority) should be able to call mesh stack functions, and that these can then trigger interrupts (which suspend the app) as needed.

    So could you explain a little more about why the architecture is this way?

  • I discovered that the discrepancy between SDK320 and SDK410: 320 was building with -DNDEBUG and 410 was not.

    If I build without -DNDEBUG then *both* SDKs assert on this IRQ issue.

    So, the next step would be:

    1. Understanding if it's possible to migrate to NRF_MESH_IRQ_PRIORITY_THREAD even though we calls to flash are a problem (as above).

    or

    2. Understanding a little more about this restriction (why is this a NRF_MESH_ASSERT_DEBUG and not NRF_MESH_ASSERT, is it safe to ignore in practice? also see questions above).

    Thank you for your help sorting this out.

  • Hi Sirio, 

    I'm checking with our mesh team about the NRF_MESH_ASSERT_DEBUG () and will let you know when I have any information. 


     I can find in both SDK v3.2 and SDK v4.1have this in gccarmemb.cmake: 

    set(CMAKE_C_FLAGS_DEBUG "-Og -g3" CACHE STRING "")
    set(CMAKE_C_FLAGS_MINSIZEREL "-Os -g " CACHE STRING "")
    set(CMAKE_C_FLAGS_RELWITHDEBINFO "-O3 -g " CACHE STRING "")
    set(CMAKE_C_FLAGS_RELEASE "-O3 -DNDEBUG" CACHE STRING "")

    So when building with RELEASE flag, -DNDEBUG will be included (in both SDK v3.2 and SDK v4.1). Could you point me to where you find the difference ? 

    It's a little bit strange that in SES I don't see any place where NDEBUG is defined, i would need to check with the team. 

    Regarding your question on why a function at lower priority shouldn't call a mesh function that supposed to run on higher priority is that we don't want a mesh activity (called by this function) to be pre-empted by a mesh event (which run on higher interrupt level) this will cause unexpected behavior. 

  • Thank you very much.

    You are correct, the SDKs do not differ, but we had made a change to SDK320 remove '-g' and add '-DNDEBUG' ... this is because the "Rel" in "MinSizeRel" implies "Release" which should not have debug prints or debug symbols.

    It was not realized at the time that this impacted NRF_MESH_ASSERT_DEBUG().

    Confirmed that generated SES projects are missing NDEBUG in c_preprocessor_definitions, even if the CMake build specifies it.

    OK, I'm tracking with that logic ... so is there a function which can be called to temporarily elevate the interrupt level when calling from main()?

    The alternative NRF_MESH_ASSERT_DEBUG sounds good in theory but seems to have some issues with handling flash (as above) ... so if possible might be better to sort out this interrupt issue and stay with NRF_MESH_IRQ_PRIORITY_LOWEST?

Related