This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Recursive calls to "app_sched_execute" in nRF Bootloader flow

Hello, I have the following observation regarding the potential problem with DFU bootloader implementation :

I have noticed that the Bootloader DFU implementation (nRF5_SDK_13.1.0) allows for the recursive calls to the "app_sched_execute" which cause multiple "execution" of the same entry from the "application" queue.

The following is the detail description of the "recursive calls" scenario:

1: The normal flow of the boot-loader DFU assumes that once the bootloading has been initialized by calling uint32_t nrf_dfu_init() function (located in "nrf_dfu.c") then the bootloader "spins" in the "wait_for_event()" function while processing the events fetched from the application queue as follows:

static void wait_for_event()
  {
      // Transport is waiting for event?
      while(true)
      {
        // Can't be emptied like this because of lack of static variables
#ifdef BLE_STACK_SUPPORT_REQD
        (void)sd_app_evt_wait();
#else
        __WFI();
#endif
        app_sched_execute();
      }
  }  

2: The " app_sched_execute()" call above fetches and processes the events from the application queue as follows:

    p_event_data = &m_queue_event_data[event_index * m_queue_event_size];
    event_data_size = m_queue_event_headers[event_index].event_data_size;
    event_handler   = m_queue_event_headers[event_index].handler;

    event_handler(p_event_data, event_data_size);

    // Event processed, now it is safe to move the queue start index,
    // so the queue entry occupied by this event can be used to store
    // a next one.
    m_queue_start_index = next_index(m_queue_start_index);

Note above that the "event" is actually "poped" from the queue (by incrementing index) AFTER the "event_handler" function has been called. The reason for this is that the "event_handler" uses the pointer at the event data "p_event_data" which point at the actual queue entry so the "queue slot" (which stores that data) must be protected from being overwritten until the current event is fully processed.

3: That implementation makes sense only under the assumption that the "event_handler" does not call " app_sched_execute()" recursively (because if it does, then the the second - recursive- call to " app_sched_execute()" will process the same queue entry since the queue index has not been incremented yet as it is observed above

4: It turns out however, that the DFU bootloader flow does implement the "recursive" calls to " app_sched_execute()" that will cause processing of the same queue entry multiple times.

Here is the detail description of the "call tree" which causes the "recursivity" of the calls to the "app_sched_execute()"

  • Assume that the particular event handler for the application queue entry being processed currently calls to "nrf_dfu_req_handler_on_req" as a part of the normal DFU bootloading flow

  • assume further, the above function calls in turn "nrf_dfu_data_req(p_context, p_req, p_res);" in the case of processing NRF_DFU_OBJECT_OP_EXECUTE opcode

  • assume further that in the above case, the nrf_dfu_wait(); is called as a part of the below loop


 while(m_flash_operations_pending)
  {
   nrf_dfu_wait();
  }  

  • The call to nrf_dfu_wait() above calls the app_sched_execute() as follows :

 void nrf_dfu_wait()
 {
    #ifdef BLE_STACK_SUPPORT_REQD
    (void)sd_app_evt_wait();
    #else
    __WFI();
    #endif
    app_sched_execute();
 }

which creates a "recursion" with respect to the calls to "app_sched_execute()" and as as a result introduces the bug described above by processing the same queue entry more than once.

My question with respect to the above scenario :

  • is there anything in the DFU booltloader code handling which prevents the above scenario from happening ?

  • is there a known fix for the above problem ?

Thanks

  • You are correct that the current usage of the Application Scheduler could lead to the the behavior you're describing. This has already been discovered internally and this will be fixed in the upcoming release of the SDK.

    Unfortunately, I do not have a patch that can be applied to SDK v13.x.x, so my advice would be to wait until the next release which is expected very soon.

    Best regards

    Bjørn

  • Hello, I have noticed that in the latest SDK release nRF5_SDK_14.0.0_3bcc1f7, the nrf_dfu_wait() function has been eliminated. Does it mean that "recursivity" problem (as described in the context of this thread) has been eliminated in that SDK version ?

  • I wanted to see if Nordic could possibly provide for the "sub-set" of the changes between SDK13 and SDK14 which actually implement the fix for the "recursivity" problem of the DFU bootloader in SDK13. The reason for the question is that we have a "production-level" DFU bootloader based on SDK13 which we would like to "patch up" to fix that problem. Porting the whole bootloader from SDK 14 is not desirable in our case given our time frame

  • I am trying to understand how exactly the problem of "recursivity" was solved in the Bootloader's implementation in SDK14. I see that, for instance, in the source code file "nrf_dfu_settings.c" the call to "wait_for_queue()" from "nrf_dfu_settings_write()" function has been removed. (That call in SDK13 was one of the reasons why the "recursive" calls to the "app_sched_execute" could have happen in SDK13).

    However, it appears that without the above call to "wait_for_queue()", the calls to nrf_dfu_flash_erase() and nrf_dfu_flash_store() in SDK14 are performed without checking if there is an available room in the command queue "nrf_fstorage_sd_queue_t m_queue" in the "nrf_fstorage" module. Is is possible that the calls to nrf_dfu_flash_erase() and nrf_dfu_flash_store() might fail b/c of no room in that queue is available ? How is that prevented in SDK14 ?

Related