This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

BLE DFU switching failure

Hey,

We've been having issues with a subset of a batch of devices produced recently, with an unacceptably high failure rate when switching to DFU from firmware (the switch is done via BLE). This has worked until now, and I'm very puzzled as to how this could be happening, or steps to take to remedy. As it is ~10% of a batch of 300 devices, it is kind of a big deal. The bootloader is almost stock nrf, and the few modifications we made are not in this path.

The in-firmware switching code was a standard DFU assignment, it is temporarily like so to make sure the register is actually properly set:

    err_code = sd_power_gpregret_clr(0, 0xffffffff);
    APP_ERROR_CHECK(err_code);
    err_code = sd_power_gpregret_set(0, BOOTLOADER_DFU_START);
    APP_ERROR_CHECK(err_code);
    uint32_t reg_val = 0x00;
    while (!(reg_val & BOOTLOADER_DFU_START)) {
        sd_power_gpregret_get(0, &reg_val);
    }
    m_dfu.evt_handler(BLE_DFU_EVT_BOOTLOADER_ENTER);
    nrf_pwr_mgmt_shutdown(NRF_PWR_MGMT_SHUTDOWN_GOTO_DFU);

Following this, the device properly switches over to bootloader (we can see this), attempts to read the DFU register (seen by its entry into dfu_enter_check), and... hard-faults coming out of it and resets, thus going back into app mode since the register is RAM-backed.

What could be causing this? And, more importantly, is there another way to remedy this?

Parents
  • Hi,

    Could you clarify that the defected devices didn't crash in the application but it crashed in the bootloader ? 
    Could you get the log from the bootloader when the hardfault happens ? 
    Was the issue consistent ? If you reflash the board do you still see the issue ? If it's consistent on some devices, I would suggest to flash the bootloader with debug setting so that you can debug the bootloader and check why it crash. 

    Something else you want to test is to not flash the application and let only the softdevice + bootloader run and check if you have the crash. 

  • Hey,

    I ran a comprehensive amount of tests and, unfortunately, nothing seems to fix it, although some things have been changed to make the switch to DFU happen at least sometimes (which is both good and bad, as it causes our hardware testing protocols to pass).

    Changes done:

    • The application, and the testing firmware, were not closing BLE connections or disabling the BLE stack before shifting to the bootloader
    • It was possible to cause the application to go into an unrecoverable state thanks to an additional component over UART and the shutdown logic regarding it
    Could you clarify that the defected devices didn't crash in the application but it crashed in the bootloader ? 

    The application crashes in the bootloader. No crash happens in the application and BLE works perfectly there; we use a BLE write-only service descriptor to trigger the switch to DFU.

    On the very first run (after testing FW), the device switches to bootloader, turns on DFU and crashes ~5s into it (long enough for the test rig to notice it advertising). After subsequent runs, even if the device is powered off and back on, the device crashes when the softdevice is being initialized in the bootloader.

    We have noticed an oddity with our test rig and the I2C bus, but it might be unrelated.

    This firmware is working on ~60-65% of devices, which makes the issue even weirder. Devices that fail fail all the time with a size-optimized firmware. Devices that work work all the time.

    Could you get the log from the bootloader when the hardfault happens ? 

    This is where things get interesting; instrumenting the bootloader causes it to work. It almost feels like a timing issue, but I'm not entirely sure.

    The best I can do is reserve some space in flash to store the information on crash, as a result.

    Something else you want to test is to not flash the application and let only the softdevice + bootloader run and check if you have the crash. 

    I have done a comprehensive amount of testing on what does and does not trigger this. For reference, the firmware uses the following:

    • softdevice s132, version 6.0.0
    • SDK 15.0

    Upgrading the softdevice+SDK has caused the issue to not be a deterministic failure, which is both a problem and an indicator that something deeper may be up.

    Building the app+bootloader with full debug capabilities causes it to work.

    Building the app+bootloader, flashing it via openocd and debugging it via gdb causes it to work.

    Removing the app entirely and forcing the bootloader to immediately switch to DFU causes it to work.

    Replacing the app with the buttonless DFU does not work.

    In terms of resources at my disposal, I can provide you with access to the firmware source code along with pointers for the critical paths used (so you do not have to trawl through 75kLOC of unnecessary code).

Reply
  • Hey,

    I ran a comprehensive amount of tests and, unfortunately, nothing seems to fix it, although some things have been changed to make the switch to DFU happen at least sometimes (which is both good and bad, as it causes our hardware testing protocols to pass).

    Changes done:

    • The application, and the testing firmware, were not closing BLE connections or disabling the BLE stack before shifting to the bootloader
    • It was possible to cause the application to go into an unrecoverable state thanks to an additional component over UART and the shutdown logic regarding it
    Could you clarify that the defected devices didn't crash in the application but it crashed in the bootloader ? 

    The application crashes in the bootloader. No crash happens in the application and BLE works perfectly there; we use a BLE write-only service descriptor to trigger the switch to DFU.

    On the very first run (after testing FW), the device switches to bootloader, turns on DFU and crashes ~5s into it (long enough for the test rig to notice it advertising). After subsequent runs, even if the device is powered off and back on, the device crashes when the softdevice is being initialized in the bootloader.

    We have noticed an oddity with our test rig and the I2C bus, but it might be unrelated.

    This firmware is working on ~60-65% of devices, which makes the issue even weirder. Devices that fail fail all the time with a size-optimized firmware. Devices that work work all the time.

    Could you get the log from the bootloader when the hardfault happens ? 

    This is where things get interesting; instrumenting the bootloader causes it to work. It almost feels like a timing issue, but I'm not entirely sure.

    The best I can do is reserve some space in flash to store the information on crash, as a result.

    Something else you want to test is to not flash the application and let only the softdevice + bootloader run and check if you have the crash. 

    I have done a comprehensive amount of testing on what does and does not trigger this. For reference, the firmware uses the following:

    • softdevice s132, version 6.0.0
    • SDK 15.0

    Upgrading the softdevice+SDK has caused the issue to not be a deterministic failure, which is both a problem and an indicator that something deeper may be up.

    Building the app+bootloader with full debug capabilities causes it to work.

    Building the app+bootloader, flashing it via openocd and debugging it via gdb causes it to work.

    Removing the app entirely and forcing the bootloader to immediately switch to DFU causes it to work.

    Replacing the app with the buttonless DFU does not work.

    In terms of resources at my disposal, I can provide you with access to the firmware source code along with pointers for the critical paths used (so you do not have to trawl through 75kLOC of unnecessary code).

Children
  • Hi Sebrenauld, 

    Have you found where exactly where it crashed in the bootloader ? If you can't run in debug mode you can add logging or LED toggling to see where exact the crash happened. 

    What you described of that some devices have issue and some doesn't would suggest that there could be a chance that not all the erratas workaround were implemented. We have multiple erratas workaround implemented in system_nrf52.c in our SDK please verify that you have that in your firmware, both in the bootloader and in the application, especiallu errata 16 and errata 108. 

    How do you trigger the switch to bootloader ? With  nrf_pwr_mgmt_shutdown(NRF_PWR_MGMT_SHUTDOWN_GOTO_DFU); or with a direct softreset  ?

    If you dump the flash of a defected unit and flash that into a working unit , does that work ? And the opposite ? 

    Have you tested with only bootloader + softdevice ? You can simply try modify the application firmware area, (erase one page for example) this will force the bootloader to stay in DFU mode without jumpping to application. And we can check if the bootloader can boot up and enter DFU more or not (without the switch from application). 

    If you make a very simple application that only trigger DFU enter when you press a button, can  you reproduce the issue ? If you can make a minimal firmware that we can use to reproduce the issue here it will be very useful. 

     

Related