BLE Recovery Image

I am trying to make a recovery image for my app that allows me to press a button on power-up to recover a possibly bad update over BLE.

If an update is sent to a device that has a critical error that prevents the SMP SVR from running or BLE devices from connecting, I would like to be able to boot into a recovery mode that allows a new update to fix the device.

With the NRF5 SDK, I could do this easily since the DFU image ran separate from the app.

I do not have access to a uart so I cannot use the MCUboot recovery.

My issue is like this case, but I do not see a solution here:

devzone.nordicsemi.com/.../make-mcuboot-select-recovery-app 

I was thinking I could use the CONFIG_BOOT_UPGRADE_ONLY setting to keep MCU Boot from using Slot1.  Then use slot1 for the recovery image?

I could either:

  1. Swap golden recovery image:

    Keep a working, tested copy in Slot1 and copy this to Slot0 when recovery mode is activated.
  2. Make a minimal build of the SMP SVR BLE for the recovery image.

    This would be nice as I would have more space for the app.

+-------------------+
| MCU Boot  		|
+-------------------+
|  App (slot0)		|
+-------------------+
| Recovery (slot1)	|
+-------------------+

Questions:

  1. I believe I need to set up the recovery image as a child image correct? Any examples to look at on how to do this?

    1. Perhaps I can just specify a hex file for the recovery image if doing approach number1?
    2. Can I make a customer recovery child image with a minimal SMP SVR BLE build that can update slot0?  

  2. Any recommendations on either of these approaches or a different approach?

  3. My biggest concern is how to setup the partitions for this.
Parents
  • Hi Luke

    nRF5 SDK vs nRF Connect SDK

    The partitioning for the nRF5 bootloader vs MCUboot looks like this:

    The reason why you could do Bluetooth Low Energy(BLE) Recovery in the nRF5 SDK was that the softdevice was not a part of the application.
    That way, the bootloader could interact with the SoftDevice.

    In the nRF Connect SDK, we do not supply a solution for BLE Recovery.

    However, what you want is not BLE recovery specifically, it is to do DFU over BLE (also known as FOTA), and make sure that your device does not break.
    As you know, this is a very usual use-case for DFU on our devices.

    The Default method

    The default MCUboot solution for doing FOTA is to have BLE functionality in your application.
    Then when you update the application, it will save the previous application:


    (ref my unofficial explanation here)

    The bootloader will enter test mode on the first DFU, then you test properly and if something fails you can revert to the working app.
    If it works, you can confirm the application.

    Your suggestion, Recovery partition

    To make sure that we are on the same page, I will sum up your proposed solution.

    Something like this right?

    With a recovery partition in the secondary slot.

    Swap golden recovery image:

    Keep a working, tested copy in Slot1 and copy this to Slot0 when recovery mode is activated.

    In the default solution, we assume that the previous application is equal to the golden image, as it presumably worked.

    For the device to become bricked at this point, you need to upload a faulty image, and then upload another faulty image after.

    The chance of this is low. 
    First, you should do testing of the images before you confirm them.
    Then, for you to be able to do FOTA with a faulty image is also low.
    Lastly, to FOTA two faulty images in a row is even lower.

    But is not the BLE Recovery image simpler, and therefore have less chance to fail if I swap to it?
    This is true yes. However, if you for example make sure that the FOTA service start first for your device and nothing else starts for a couple of seconds after a reset, I would argue that you decrease the chance for something else to fail.
    Then the user can just reset the device if something deadlocks.

    Recovering from confirmed image is still impossible without serial recovery.
    However, even if you have a golden recovery partition, you usually stop being in "test mode" after a DFU.
    Now that the application is Confirmed, the bootloader will not swap back.
    The only way to make it swap back is to do a new DFU or tell MCUboot to revert from the application. You can do neither of these if the application bricks.
    In other words: you will have the same issue here no matter if you use a golden recovery image or the previous application.

    Make a minimal build of the SMP SVR BLE for the recovery image.

    This would be nice as I would have more space for the app.

    It would be nice to have more space for the application.
    Unfortunately MCUboot requires that the slots that are swapped (primary and secondary) are the same size, so this does not work.

    To fix this you would have to Change MCUboot(see below)

    Add BLE to MCUboot

    Just have to start by saying: I do not recommend this.
    But it is nice to have mentioned this option either way.

    If you want a golden recovery partition either way, it would be more or less the same as adding BLE to the bootloader:

    • Firstly it would require ca the same amount of flash space.
    • Secondly you require the golden recovery partition to be stable as well, so could as well be a part of MCUboot in that case.
    • You will write to flash less times by not having to swap every time you need to boot.
    • You could actually have recovery, and be able to detect FOTA on each boot, not matter if the application have crashed

    Now all of this sounds good, but here are some raesons to why you should not have BLE in MCUboot:

    • You can never update the BLE part of MCUboot. If any security vulnerabilities are found, you can not fix them.
    • A bootloader should be as simple as possible, to decrease the chance of failure.
    • You do not want to Change MCUboot(see below)

    Change MCUboot

    MCUboot is a large project with over 1800 commits.
    The official code in MCUboot has been verified by others before added to the project.
    The MCUboot code has been tested by a lot of other people, and they are likely to have found errors, especially fatal ones.

    For any change to MCUboot, you loose some of these assurances.
    Small changes, such as 1-line fixes have a rather low chance of breaking stuff if tested properly.
    But if you add actual features to MCUboot, such as different size slots of BLE functionality, I would say that you will greatly increase the risk of something going wrong.
    This is the reason to why do not recommend large changes to MCUboot.

    How to make sure the default method works

    Okey, so if you are to use the default method, what can you do yo make sure it does not fail?
    Here are some suggestions by me.
    Disclaimer: There are ideas from the top of my head. It is not an exhaustive list, but to show that there are multiple things you can do to make your application more reliable.

    Test the new application

    When doing DFU, enter test mode by default.
    Then do thorough testing of the application before confirming it.
    I must admit that I have not really worked a lot with writing testing for this, so I can not give any specific tips to this.

    Fail to reset

    Make sure that when your application resets on errors as often as possible.
    Then make sure that your application has a known reset behavior, preferably one you can do FOTA from.

    What about unexpected deadlocks?

    If the customer can manually reset the device, for example by removing a battery, this should be fine I think.

    Risk Management

    What about cosmic rays, what if the application still bricks?

    I will claim that no application can be 100% fault safe.
    Reliability is really just risk management.

    How much are you willing to invest to make the device reliable?
    And how reliable do you want your device to be?
    What is an acceptable chance for a device to brick?

    So maybe even ask yourself: If the device is cheap, is it fine if some of the devices are returned?

    On the other hand, if the device is expensive/critical, maybe you should fix this in hardware and add an option for serial recovery?

    Remarks

    In short, I recommend the default MCUboot method of swapping to the previous image and testing.

    This has been a lot, but I hope to reuse the answer, cause as I said: you are not the first nor the last who ask about this.
    I also hope that it gave you some insights into this.
    Let me know if you have any questions to this!

    Regards,
    Sigurd Hellesvik

  • On the other hand, if the device is expensive/critical, maybe you should fix this in hardware and add an option for serial recovery?

    Not possible, all of our products are sealed for waterproofing, and do not have the space or budget for an expensive serial recovery port.  It would be very hard to convince a product manager to add an expensive waterproof port when we didn't need it on the old SDK.

    First, you should do testing of the images before you confirm them.

    True, except covering all possible issues is nearly impossible in an embedded system.  Especially if multiple developers are working on a system.

    My biggest concern is the SMP SVR Bluetooth setup itself.  The example code for the BLE SMP SVR required the Bluetooth TX buffer count to be increased.  What if a junior dev changes this on a subsequent build?  This could potentially brick subsequent updates for all of our devices. Hopefully, it would be caught in testing, but this seems like a foot gun that I would not want to put in a system.

    It seems like a very risky setup to include the firmware update gateway (SMP) in the application build.  Most embedded devices have some sort of failsafe to enter the bootloader to avoid totally bricking the system (button combos on phones, bios on a computer, etc).  I guess this failsafe is only the serial bootloader for MCUBoot, but that isn't an option for us.

    Even if we only update in Test mode, how do we know the updated app has a working SMP?  You can't really test this in the field after the update, you have to trust it works. It would be totally reliant on HIL testing in-house with no fail-safe backup.  

    So maybe even ask yourself: If the device is cheap, is it fine if some of the devices are returned?

    Most of our products range from fairly pricy to expensive.  I understand there will always be RMA fallout for hardware, but I do not consider it acceptable to possibly brick devices in the field with software.  This has the unlikely, but the potential for ruining all products, not just a few.

    Question:

    Wouldn't it be possible to include a build of the SMP update and place it in its own partition?   EG: MCUBoot, Slot0, Slot1, Recovery.  Then I could have MCUBoot jump to the recovery partition on recovery condition (buttons held on power up). 

    The nice thing about this is the SMP Update would be tested initially and then not change with application updates.  It would be fixed in flash and never update or change. 

  • Hi Luke

    Thank you for all the good points above!
    It is very useful both to get more insight into what our customers use, and to get some other views on the matter!

    I think I understand what you want to have here, and as previously mentioned you are not the first who ask about this.
    I have discussed this a bit with some of my colleagues in tech-support, and so far we think maybe a golden SMP partition could be done without changing MCUboot.

    We will have a talk with our bootloader developers about this, to see what options we have, and what wpuld be the best approach.

    I will return with some more information tomorrow or Thursday on what we found.

    Regards,
    Sigurd Hellesvik

Reply
  • Hi Luke

    Thank you for all the good points above!
    It is very useful both to get more insight into what our customers use, and to get some other views on the matter!

    I think I understand what you want to have here, and as previously mentioned you are not the first who ask about this.
    I have discussed this a bit with some of my colleagues in tech-support, and so far we think maybe a golden SMP partition could be done without changing MCUboot.

    We will have a talk with our bootloader developers about this, to see what options we have, and what wpuld be the best approach.

    I will return with some more information tomorrow or Thursday on what we found.

    Regards,
    Sigurd Hellesvik

Children
  • Hi Luke,

    After talking with both my colleagues and some of our developers, here is the solution I think is the best one.
    Keep in mind that this is a design question and there is likely multiple good solutions.

    Neither we nor mcuboot supports such a golden recovery partition yet.
    I will suggest it to our developers, and it will be up to them to if and when they want to implement such a feature.
    So if you want it soon you will have to do the heavy-lifting yourself.
    I think you already knew this, just said it to be sure.

    Anyhow, here is my suggestion:

    Swap into golden recovery partition on button hold

    Firstly, just have the same partitioning as normal mcuboot behavior.
    The images will be swapped if the secondary slot is tagged with "test" or "confirm", as normal.

    I assume that you have a button available on the device.
    If not there is likely other solutions to trigger the golden recovery.

    Then add something like this to MCUboot (psuedo-code):

    if (button_held):
        mark_secondary_slot_as_test()

    Since the secondary slot is now tagged as "test", MCUboot will swap the slots and boot into Golden Recovery.
    When in Golden Recovery, do BLE SMP and write new application to current secondary slot.
    On next reboot, swap back into new application.

    Limit application from overwriting golden recovery partition

    When doing this, it is important that the application can not be allowed to overwrite the Golden Recovery Partition.
    For a start, just never add DFU/FOTA functionality to the application and be careful to keep inside partitions.

    If you want to be even more safe, consider adding some extra protection here. I am not sure what it would be at the moment, but as you can see below it is important that the golden recovery is updatable, so we do not want to make it read-only either.

    Available space

    MCUboot requires that both slots have the same size. So for this, the Golden Recovery Partition would have more available space than it needs.
    For this solution I can not see a way to reduce flash size, except for creating a git issue with upstream MCUboot and ask them for more flexible slot sizes.
    Other solutions for Golden Recovery partitions might be able to do less flash size, for example if you do something similar to this, but with NSIB+MCUboot instead. I have not looked more into this though.

    Updatable Golden Recovery Partition

    Both colleagues and developers I have spoken to agree that it is a good idea to be able to update such a golden recovery partition.
    Just because something can not change does not mean that it does not contain bugs.
    This is especially true for a large application, such one with BLE.

    This should be possible with the above solution.
    Instead of writing a new application over the previous application, just write a Golden Recovery partition over the application.

    Now you have 2 Golden Recovery partitions and should be able to write a new golden recovery image to the right place.
    Then use the updated Golden Recovery partition to "update" the application to get back to a normal state.
    This takes a lot of writes, but this operation should not be frequently used.

    Changes to MCUboot

    Yes this requires that you make changes to MCUboot.
    The changes in this suggestion would be small though, and so it would be a lower risk of something going wrong.

    Let us know if you have any questions or feedback to this suggestion.

    Regards,
    Sigurd Hellesvik

  • Hello Sigurd, do you know if there has been any progress on Golden Recovery Partition?

    Best Regards, Markus

  • Hi Markus,

    We do not have any official implementation for such a module now either no.

    Regards,
    Sigurd Hellesvik

Related