nrf_802154_request_swi.c doesn't properly ensure SWI requests are completed before returning

It seems to be an assumption in the swi request functions that after req_exit, the swi request should be completed. However, in certain cases, this invariant doesn't hold. I have not narrowed down what's causing this behavior, but I have made a repro on a devkit with a minimally modified nordic provided sample. This behavior can completely break the driver. For instance, it can result in nrf_802154_request_sleep returning false before the SWI ever finished the operation. The SWI never gets the chance to update p_result prior to it being copied to the return value.

Repro:

1. Make new workspace directory i.e. mkdir workspace.

2. Install whatever tools you need (west, sdk, pip install -r, etc.).

3. mkdir .west.

4. Add the attached config file into .west.

5. mkdir west-manifest.

6. Add attached west.yml into west-manifest folder.

7. west update.

8. Run west build -b nrf54l15dk/nrf54l15/cpuapp/ns ./nrf/samples/openthread/cli -- -DCONFIG_SPEED_OPTIMIZATIONS=y from workspace folder.

9. Run app. You'll notice nothing happens. Openthread loops trying to put the radio to sleep and constantly failing since sleep returns false before the SWI gets a chance to act.

10. Apply attached patch to nrfxlib.

11. Run the same build command and run again. Now you'll see the devkit continually print out "This should never happen!" since the `touched` variable added to the nrf_802154_req_data_t isn't updated by the SWI after req_exit() returns.

One of my initial thoughts was the compiler might be optimizing and assuming that p_slot can never change, but that doesn't seem to be the case when looking at the disassembly. 

[manifest]
path = west-dir
file = west.yml

[zephyr]
base = zephyr
manifest:
  remotes:
    - name: nrfconnect
      url-base: https://github.com/nrfconnect

  projects:
    - name: nrf
      remote: nrfconnect
      repo-path: sdk-nrf
      revision: 002e57174cdd7d6ec7dbb51745a484b34b7ce229
      import:
        name-allowlist:
          - zephyr
          - nrfxlib
          - cmsis_6
          - hal_nordic
          - libmetal
          - open-amp
          - segger
          - openthread
          - mbedtls
          - mcuboot
          - trusted-firmware-m
          - oberon-psa-crypto
          - nrf-802154

diff --git a/nrf_802154/driver/src/nrf_802154_request_swi.c b/nrf_802154/driver/src/nrf_802154_request_swi.c
index 1e884e85f..156bc5b77 100644
--- a/nrf_802154/driver/src/nrf_802154_request_swi.c
+++ b/nrf_802154/driver/src/nrf_802154_request_swi.c
@@ -101,6 +101,7 @@ typedef enum
 typedef struct
 {
     nrf_802154_req_type_t type; ///< Type of the request.
+    bool touched;
 
     union
     {
@@ -333,6 +334,7 @@ static bool active_vector_priority_is_high(void)
  * @param[in]   term_lvl  Termination level of this request. Selects procedures to abort.
  * @param[out]  p_result  Result of entering the sleep state.
  */
+void printk(const char* fmt, ...);
 static void swi_sleep(nrf_802154_term_t term_lvl, bool * p_result)
 {
     nrf_802154_req_data_t * p_slot = req_enter();
@@ -340,8 +342,16 @@ static void swi_sleep(nrf_802154_term_t term_lvl, bool * p_result)
     p_slot->type                = REQ_TYPE_SLEEP;
     p_slot->data.sleep.term_lvl = term_lvl;
     p_slot->data.sleep.p_result = p_result;
+    p_slot->touched = false;
 
     req_exit();
+    if (!p_slot->touched)
+    {
+        while(true)
+        {
+            printk("This should never happen!\n");
+        }
+    }
 }
 
 /**
@@ -856,6 +866,7 @@ static void irq_handler_req_event(void)
             case REQ_TYPE_SLEEP:
                 *(p_slot->data.sleep.p_result) =
                     nrf_802154_core_sleep(p_slot->data.sleep.term_lvl);
+                p_slot->touched = true;
                 break;
 
             case REQ_TYPE_RECEIVE:

  • Hi, 

    Please refer to Installing the nRF Connect SDK to install NCS. 

    Which NCS version would you want to use?

    Regards,
    Amanda H.

  • Reproduced on 3.1.1 of NCS as well using the same openthread/cli sample. Just change to optimize for speed and you get the behavior described above. Applying the patch I sent to nrfxlib in the NCS sdk results in the exact same behavior as well i.e. cd ncs/v3.1.1/nrfxlib; git apply <my_patch>. 

  • Hi, 

    Our FW is not verified against CONFIG_SPEED_OPTIMIZATIONS. Do you require it? Is there any specific use case for this?

    Do you see the same issue when building without "-- -DCONFIG_SPEED_OPTIMIZATIONS=y"?

    -Amanda H.

  • I don't need it. My current work around is turning it off. Regardless, SPEED_OPTIMIZATIONS breaking the firmware seems like a bug. It seems possible that turning it off is just hiding some other bug instead of SPEED_OPTIMIZATIONS being the cause of the bug.

  • (updated)

    michael.feist.etc said:
    I don't need it. My current work around is turning it off.

    If you don't need it, you can turn it off. 

    The nRF 802.15.4 Radio Driver is deeply verified with the default CONFIG_SIZE_OPTIMIZATIONS=y and the compiler provided with the given nRF Connect SDK release. The nRF 802.15.4 Radio Driver is not verified at all with other optimizations,  CONFIG_SPEED_OPTIMIZATIONS=y being just one of them. We can't verify against all Kconfig options from Zephyr or NCS because the number of configurations to test grows exponentially.

    The team has reproduced the issue, and the root cause is that the interrupt does not execute immediately after interrupts are re-enabled in req_exit. As a quick fix, I could suggest to change the req_exit function in the nrf_802154_request_swi.c so that there are __DMB() and __ISB() instructions as below:

    static void req_exit(void)
    {
        nrf_802154_queue_push_commit(&m_requests_queue);
    
        nrf_egu_task_trigger(NRF_802154_EGU_INSTANCE, REQ_TASK);
        __DSB();
    
        nrf_802154_mcu_critical_exit(m_mcu_cs);
    
        __DSB();
        __ISB();
    }
     

    This causes the reproduction with the p_slot->touched , as you provided in the nrfxlib-repro.patch.txt , does not occur anymore.

    However, please take into account that we currently do not test the driver with CONFIG_SPEED_OPTIMIZATIONS=y. Even with this fix, there might also be other issues.

Related