Mcuboot swap using move will brick device when power interrupted during move

Dear all,

I'm working with the nrfconnect sdk on the nrf52840 and we are experiencing issues on the nrfconnect sdk 2.0.0 which uses sdk-mcuboot 1.9.99-ncs1.

We are using a swap using move algorithm with a swap_type test to update our firmware on the devices. Our bootloader is never updated (at least not for now) and we have a primary and secondary partition with only a single image, so no mult-image swaps. If we start the swap and accidentally a reboot happens (which happened at our factory and is what i'm currently reproducing at home), it will sometimes occur (once every 50 times or so) that it unrecoverably bricks the device. If I take a look at the device's flash I see that the image header is moved 0x1000 bytes. From then on the primary partition image header cannot be read and it will never recover from that (see logs). Shouldn't the image trailer at least sign for a swap failed so it can recover using the second partition?

uart:~$ uart:~$ *** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x1, image_ok=0x1
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Swap type: test
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Starting swap using move algorithm.
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
W: Failed reading image headers; Image=0
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
E: Unable to find bootable image
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
W: Failed reading image headers; Image=0
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
E: Unable to find bootable image
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46

Firmware signature verified.

Firmware version 641

Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
W: Failed reading image headers; Image=0
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
E: Unable to find bootable image
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.

Attempting to boot from address 0x9200.

These are the logs from when it recovers from a power interruption:

uart:~$ [00:00:45.716,552] <inf> ftp: Finish transfer 0
uart:~$ Requesting upgrade

uart:~$ uart:~$ *** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.


Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46


Firmware signature verified.

Firmware version 641


Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x1, image_ok=0x1
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Swap type: test
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Starting swap using move algorithm.
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.


Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46


Firmware signature verified.

Firmware version 641


Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
I: Starting swap using move algorithm.
*** Booting Zephyr OS build v3.0.99-ncs1  ***
Attempting to boot slot 0.


Attempting to boot from address 0x9200.

Verifying signature against key 0.

Hash: 0xb6...46


Firmware signature verified.

Firmware version 641


Booting (0x9200).

*** Booting Zephyr OS build v3.0.99-ncs1  ***
I: Starting bootloader
I: Primary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: primary slot
I: Starting swap using move algorithm.
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Bootloader chainload address offset: 0x23000
I: Jumping to the first image slot

I suppose I'd need to enable the MCUBOOT_BOOTSTRAP flag in order to recover from this, but I'd expect it to detect it to not have a header and thus look for the header in the second sector anyways. Otherwise this issue is unrecoverable. When I set the MCUBOOT_BOOTSTRAP flag though, the device will bootstrap from startup, which is not what I want. I just want it to recover from a failed swap.

Best and thanks in advance,
Imara

Parents
  • Some more info about this issue:

    Our reboot is issued whenever the entire device is updated. This causes the power to the nrf52840 to also briefly be cut, causing a reboot of the nrf52840.

    Both partitions are on internal flash and the images are not encrypted

    The bin image is 0x5FB40 in size, the mcuboot_primary is 0x6D000, the mcuboot_primary_app starts 0x200 later and is 0x6CE00. The mcuboot_secondary starts right after and is 0x6D000. I've already tried increasing the primary partition to 0x6E000 so that it'd be a sector larger than the secondary, but that didn't change anything yet. The hex I've added comes from the prior situation where both partitions are equal size (which should also be allowed according to the documentation).

    Please see the attached hexes of the last sectors (from when the partitions were still equally sized) after reproducing the issue.

    The image trailer is exactly the same except for the copy_done and image_ok flags (I think these are the values in the memory, but I'm not sure). This means that the faulty image trailer indicates that a swap has been successfully done, but the image hasn't been confirmed. Is is possible that this image trailer is left from a previous swap and isn't correctly cleaned up after a swap? In my tests, I only ever reproduce this issue after > 25 updates, so that'd mean that many successful swaps happened prior to the faulty case.

    Another thought though: If the error occurs during the move, not the swap it would result in a state that isn't described in this section of the documentation, since copy_done isn't set yet. Does that mean that when the move (instead of the swap) is interrupted, it will never set the outcome to BOOT_SWAP_TYPE_REVERT and thus not recover?

    failed_swap_hexes.zip

  • Another thought though: If the error occurs during the move, not the swap it would result in a state that isn't described in this section of the documentation, since copy_done isn't set yet. Does that mean that when the move (instead of the swap) is interrupted, it will never set the outcome to BOOT_SWAP_TYPE_REVERT and thus not recover?

    The bool_status_is_reset will also think the status is reset when the swap is still busy moving the sectors us. Therefor it is required to read all headers, even though the header of the primary image has moved.

    bool
    boot_status_is_reset(const struct boot_status *bs)
    {
        return (bs->op == BOOT_STATUS_OP_MOVE &&
                bs->idx == BOOT_STATUS_IDX_0 &&
                bs->state == BOOT_STATUS_STATE_0);
    }
    
    #ifdef MCUBOOT_SWAP_USING_MOVE
            /*
             * Must re-read image headers because the boot status might
             * have been updated in the previous function call.
             */
            rc = boot_read_image_headers(state, !boot_status_is_reset(bs), bs);
    #ifdef MCUBOOT_BOOTSTRAP
            /* When bootstrapping it's OK to not have image magic in the primary slot */
            if (rc != 0 && (BOOT_CURR_IMG(state) != BOOT_PRIMARY_SLOT ||
                    boot_check_header_erased(state, BOOT_PRIMARY_SLOT) != 0)) {
    #else
            if (rc != 0) {
    #endif
    
                /* Continue with next image if there is one. */
                BOOT_LOG_WRN("Failed reading image headers; Image=%u",
                        BOOT_CURR_IMG(state));
                BOOT_SWAP_TYPE(state) = BOOT_SWAP_TYPE_NONE;
                return;
            }
    #endif
    

    This then causes the code to return and never reach boot_complete_partial_swap as far as I can tell.

  • Hi Imara,

    Sorry for the wait.

    With the excellent information you have provided, I will try to reproduce your issue.

    A bootloader should be robust, and it should be tough to brick a device with a reset.
    If this turns out to be the case, I will let our developers know, and we will look into how we can fix this.

    I will return with more information on my progress on Thursday latest.
    Please let me know if you have found anything else since creating this post!

    Regards,
    Sigurd Hellesvik

Reply
  • Hi Imara,

    Sorry for the wait.

    With the excellent information you have provided, I will try to reproduce your issue.

    A bootloader should be robust, and it should be tough to brick a device with a reset.
    If this turns out to be the case, I will let our developers know, and we will look into how we can fix this.

    I will return with more information on my progress on Thursday latest.
    Please let me know if you have found anything else since creating this post!

    Regards,
    Sigurd Hellesvik

Children
Related