Release build with MCUboot occasionally bricks custom board

I'm developing for a custom board that has an nRF52832 and an FM25W04 external flash. My application uses nRF Connect SDK v2.9.2. I have a debug build that doesn't use a bootloader, and a release build that uses MCUboot as the bootloader. I've set up my partitions so that the bootloader, primary image, and scratch slots are on internal flash and the secondary image slot is on external flash.

The issue I am seeing is that flashing my release build onto one of my custom boards will occasionally brick the device. This is not a true brick however, because if I flash a debug build onto the bricked board the app will run fine, but then if I reflash the release build the board will brick again. Flashing an identical release build onto an identical custom board will usually not brick the second board.

I am having a difficult time debugging this problem and was hoping to get some advice on how to move forward. I believe this could be due to an issue with how I've configured my external flash chip via devicetree and/or Kconfig, or possibly an issue with how I've setup my partitions to place the secondary image slot on external flash. I'm providing some relevant snippets from files below, please let me know if any additional context is needed.

pinctrl.dtsi:

&pinctrl {
	spi2_default: spi2_default {
		group1 {
			psels = <NRF_PSEL(SPIM_MISO, 0, 19)>;
			bias-pull-up;
		};
		group2 {
			psels = <NRF_PSEL(SPIM_MOSI, 0, 18)>,
					<NRF_PSEL(SPIM_SCK, 0, 17)>;
		};
	};

	spi2_sleep: spi2_sleep {
		group1 {
			psels = <NRF_PSEL(SPIM_MISO, 0, 19)>,
			        <NRF_PSEL(SPIM_MOSI, 0, 18)>,
			        <NRF_PSEL(SPIM_SCK, 0, 17)>;
			low-power-enable;
		};
	};
};

.dts:

/ {
    chosen {
        zephyr,flash = &flash0;
        zephyr,code-partition = &slot0_partition;
        nordic,pm-ext-flash = &fm25w04;
    };
};

&flash0 {
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

        boot_partition: partition@0 {
            label = "mcuboot";
            reg = <0x00000000 0xC000>;
        };
        slot0_partition: partition@C000 {
            label = "image-0";
            reg = <0x0000C000 0x58000>;
        };
        scratch_partition: partition@64000 {
            label = "image-scratch";
            reg = <0x00064000 0x1000>;
        };
    };
};

&spi2 {
    compatible = "nordic,nrf-spi";
    status="okay";
    pinctrl-0 = <&spi2_default>;
    pinctrl-1 = <&spi2_sleep>;
    pinctrl-names = "default", "sleep";
    cs-gpios = <&gpio0 20 GPIO_ACTIVE_LOW>;

    fm25w04: fm25w04@0 {
        compatible = "jedec,spi-nor";
        reg = <0>;
        spi-max-frequency = <8000000>;
        jedec-id = [a1 28 13];
        size = <4194304>;
        has-dpd;
        t-enter-dpd = <3000>;
        t-exit-dpd = <3000>;
    };
};

&fm25w04 {
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

        slot1_partition: partition@0 {
            label = "image-1";
            reg = <0x00000000 0x58000>;
        };
        config_partition: partition@58000 {
            label = "config";
            reg = <0x00058000 0x6000>;
        };
        record_partition: partition@5E000 {
            label = "record";
            reg = <0x0005E000 0x10000>;
        };
        nvs_storage: partition@6E000 {
            label = "nvs_storage";
            reg = <0x0006E000 0x6000>;
        };
    };
};

pm_static.yml:

mcuboot:
  address: 0x00000000
  size: 0xC000
mcuboot_primary:
  address: 0x0000C000
  size: 0x58000
  span: [mcuboot_pad, mcuboot_primary_app]
mcuboot_pad:
  address: 0x0000C000
  size: 0x200
mcuboot_primary_app:
  address: 0x0000C200
  size: 0x57E00
  span: [app]
mcuboot_scratch:
  address: 0x00064000
  size: 0x1000
flash_guard:
  address: 0x00065000
  size: 0x1B000

external_flash:
  device: fm25w04
  region: external_flash
  address: 0x00000
  size: 0x80000
  span: [mcuboot_secondary, config, record, nvs_storage]

mcuboot_secondary:
  region: external_flash
  address: 0x00000000
  size: 0x58000
config:
  region: external_flash
  address: 0x00058000
  size: 0x6000
record:
  region: external_flash
  address: 0x0005E000
  size: 0x10000
nvs_storage:
  region: external_flash
  address: 0x0006E000
  size: 0x6000

child_image/mcuboot.overlay:

&spi2 {
    compatible = "nordic,nrf-spi";
    status="okay";
    pinctrl-0 = <&spi2_default>;
    pinctrl-1 = <&spi2_sleep>;
    pinctrl-names = "default", "sleep";
    cs-gpios = <&gpio0 20 GPIO_ACTIVE_LOW>;

    fm25w04: fm25w04@0 {
        compatible = "jedec,spi-nor";
        reg = <0>;
        spi-max-frequency = <8000000>;
        jedec-id = [a1 28 13];
        size = <4194304>;
        has-dpd;
        t-enter-dpd = <3000>;
        t-exit-dpd = <3000>;
    };
};

/ {
    chosen {
        nordic,pm-ext-flash = &fm25w04;
    };
};

child_image/mcuboot/prj.conf:

CONFIG_GPIO=y
CONFIG_SPI=y
CONFIG_SPI_NOR=y
CONFIG_SPI_NOR_SFDP_RUNTIME=y
CONFIG_SPI_NOR_FLASH_LAYOUT_PAGE_SIZE=4096
CONFIG_NORDIC_QSPI_NOR=n
CONFIG_MULTITHREADING=y
CONFIG_FLASH=y
CONFIG_PM_EXTERNAL_FLASH_MCUBOOT_SECONDARY=y
CONFIG_PM_OVERRIDE_EXTERNAL_DRIVER_CHECK=y
CONFIG_BOOT_MAX_IMG_SECTORS=128
CONFIG_BOOT_SWAP_USING_SCRATCH=y
CONFIG_WATCHDOG=y
CONFIG_BOOT_WATCHDOG_FEED=y

Parents
  • Hello,

    This is not a true brick however, because if I flash a debug build onto the bricked board the app will run fine, but then if I reflash the release build the board will brick again. Flashing an identical release build onto an identical custom board will usually not brick the second board.

    In this case:

    1: If you flash the release build, and it "bricks", and you flash the release build once more, without flashing anything else in between. Is it still bricked? Even if you erase the chip in between? And does it brick immediately, or after a little while?

    2: If you flash the bricked release build to another identical board, and it doesn't brick, does it behave completely like normal?

    3: Before the brick. Was there ever done a DFU (since the last time the device was reprogrammed using a debugger)? 

    4: What are the symptoms of a bricked device? Do you have any log output?

    Best regards,

    Edvin

Reply
  • Hello,

    This is not a true brick however, because if I flash a debug build onto the bricked board the app will run fine, but then if I reflash the release build the board will brick again. Flashing an identical release build onto an identical custom board will usually not brick the second board.

    In this case:

    1: If you flash the release build, and it "bricks", and you flash the release build once more, without flashing anything else in between. Is it still bricked? Even if you erase the chip in between? And does it brick immediately, or after a little while?

    2: If you flash the bricked release build to another identical board, and it doesn't brick, does it behave completely like normal?

    3: Before the brick. Was there ever done a DFU (since the last time the device was reprogrammed using a debugger)? 

    4: What are the symptoms of a bricked device? Do you have any log output?

    Best regards,

    Edvin

Children
  • Hi Edvin,

    Thanks for getting back to me.

    1: If you flash the release build, and it "bricks", and you flash the release build once more, without flashing anything else in between. Is it still bricked? Even if you erase the chip in between? And does it brick immediately, or after a little while?

    Yes, if I reflash the release build the device is still bricked, even if I erase the chip in between. I usually erase the chip using one of these commands:

    nrfutil device erase
    nrfutil device recover

    The device bricks immediately (the application never starts so I believe the bricked devices are getting stuck in the bootloader).

    2: If you flash the bricked release build to another identical board, and it doesn't brick, does it behave completely like normal?

    Yes, the second board that doesn't brick behaves completely like normal.

    3: Before the brick. Was there ever done a DFU (since the last time the device was reprogrammed using a debugger)? 

    I have been performing DFU's on our devices with the release build using the Android version of the nRF Connect mobile app. However, I cannot say for sure that all of the bricked boards (currently 3) have had DFU's performed on them.

    4: What are the symptoms of a bricked device? Do you have any log output?

    There are a few symptoms I see:

    1. We have set up our application to use the Zephyr logging API via RTT, but I don't see any logs
    2. Our application plays an LED sequence on startup, but I do not see this occurring
    3. Our application has a vibration motor triggered via a touch sensor, this is unresponsive
    4. An unbricked device can be discovered via the nRF Connect mobile app, but a bricked device will not show up in the app

    Thank you!

    jyv

  • But, if you flash a "debug build" (without a bootloader) to one of the bricked boards, they run as normal, right? So the HW itself is working, right? And apparently also the application?

    I guess there is no way for me to reproduce what you are seeing using a DK? 

    Did you do any changes to the bootloader? 

    How did you upload it? You mentioned the nRF Connect for Mobile App, but when you did, did you select "test", "run", "validate" or something like that? Can you please try uploading it using the nRF Device Manager for mobile?

    Best regards,

    Edvin

  • Yes, if I flash a debug build to a bricked board it will run as normal, which leads me to believe that the hardware itself is fine as well as the application.

    Reproduction is one of the issues I am facing in debugging this problem as I have not found any clear reproduction steps. Most of the time when I flash a release build it runs fine, and occasionally it bricks.

    I have not made any changes to the bootloader. Besides the Kconfigs (child_image/mcuboot/prj.conf) and Devicetree properties (child_image/mcuboot.overlay) that I included earlier, I also have a child_image/mcuboot.conf:

    CONFIG_BOOT_SIGNATURE_KEY_FILE="/Users/jyv/Projects/my_project/keys/secret.pem"

    I also have a release.conf (which extends my application's prj.conf) that has some bootloader related Kconfigs set:

    CONFIG_ASSERT=n
    
    CONFIG_WATCHDOG=y
    CONFIG_BRICK_GUARD=y
    CONFIG_BRICK_GUARD_TIMEOUT=60000
    
    CONFIG_FLASH_LOG_LEVEL_WRN=y
    
    CONFIG_ARCH_LOG_LEVEL_OFF=y
    CONFIG_KERNEL_LOG_LEVEL_OFF=y
    CONFIG_MPU_LOG_LEVEL_OFF=y
    
    CONFIG_SIZE_OPTIMIZATIONS=y
    CONFIG_LTO=y
    CONFIG_ISR_TABLES_LOCAL_DECLARATION=y
    
    CONFIG_BOOTLOADER_MCUBOOT=y
    CONFIG_MCUBOOT_BUILD_STRATEGY_FROM_SOURCE=y
    CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION="0.0.1"
    CONFIG_PM_EXTERNAL_FLASH_MCUBOOT_SECONDARY=y
    CONFIG_PM_OVERRIDE_EXTERNAL_DRIVER_CHECK=y
    
    CONFIG_ZCBOR=y
    CONFIG_MCUMGR=y
    CONFIG_MCUMGR_TRANSPORT_BT=y
    CONFIG_MCUMGR_TRANSPORT_BT_PERM_RW=y
    
    CONFIG_STREAM_FLASH=y
    CONFIG_IMG_MANAGER=y
    CONFIG_MCUMGR_GRP_IMG=y
    CONFIG_MCUMGR_GRP_OS=y
    
    CONFIG_MCUMGR_MGMT_NOTIFICATION_HOOKS=y
    CONFIG_MCUMGR_GRP_IMG_STATUS_HOOKS=y
    
    CONFIG_BT_CTLR_ASSERT_OVERHEAD_START=n
    
    CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=3072

    When I uploaded an image to the device using the nRF Connect for Mobile App, I would use the app_update.bin file generated by my build configuration and I would select the test and confirm option. This process would usually take about a minute to complete (image upload, new image is moved to primary slot, board resets and confirms new image version via logs).

    I have not tried performing a DFU via the nRF Device Manager app, I will try that and get back to you.

    Thank you.

  • One thing to check is that the CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION="0.0.1" is higher than the previous one. If it is lower, it will be rejected. However, if you erase the chip (nrfutil device erase) then there should be no traces of another application image with a higher application sign version.

    Let me know what you find!

    BR,

    Edvin

  • Hi Edvin, sorry for the delay. I had to work on other stuff and am just getting back to this now. I have been making sure that the CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION value is higher than the current value when I perform a DFU. I've also been erasing the secondary image slot on the external flash, but that doesn't fix the "bricking" issue.

    I've been doing some more debugging by trying to look at the bootloader logs and I was able to capture these log messages once on a bricked board:

    [00:00:00.599,395] <err> os: ***** BUS FAULT *****
    [00:00:00.599,700] <err> os:   Imprecise data bus error
    [00:00:00.600,006] <err> os: r0/a1:  0x8bbbf596  r1/a2:  0x1abd4294  r2/a3:  0x00000128
    [00:00:00.605,468] <err> os: r3/a4:  0x00000001 r12/ip:  0x00000040 r14/lr:  0x0000a5db
    [00:00:00.605,895] <err> os:  xpsr:  0x21000000
    [00:00:00.606,170] <err> os: Faulting instruction address (r15/pc): 0x0000a1b6
    [00:00:00.606,567] <err> os: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0
    [00:00:00.606,964] <err> os: Current thread: 0x20000660 (unknown)
    [00:00:00.607,299] <err> os: Halting system

    I then ran addr2line:

    % arm-zephyr-eabi-addr2line -e build-release/mcuboot/zephyr/zephyr.elf 0x0000a1b6
    /opt/nordic/ncs/v2.9.2/modules/crypto/mbedtls/library/bignum.c:936

    So it looks like there was a fault in this function in mbedtls:

    /*
     * Compare signed values
     */
    int mbedtls_mpi_cmp_mpi(const mbedtls_mpi *X, const mbedtls_mpi *Y)
    {
        size_t i, j;
    
        for (i = X->n; i > 0; i--) {
            if (X->p[i - 1] != 0) {
                break;
            }
        }
    
        for (j = Y->n; j > 0; j--) {
            if (Y->p[j - 1] != 0) {
                break;
            }
        }
    
        if (i == 0 && j == 0) {
            return 0;
        }
    
        if (i > j) {
            return X->s;
        }
        if (j > i) {
            return -Y->s;
        }
    
        if (X->s > 0 && Y->s < 0) {
            return 1;
        }
        if (Y->s > 0 && X->s < 0) {
            return -1;
        }
    
        for (; i > 0; i--) {
            if (X->p[i - 1] > Y->p[i - 1]) {
                return X->s;
            }
            if (X->p[i - 1] < Y->p[i - 1]) {
                return -X->s;
            }
        }
    
        return 0;
    }

    I have not been able to determine any more information from this though and am still working on it. 

    Also, another thing I have found is that enabling the compiler stack canary in my child_image/mcuboot/prj.conf:

    CONFIG_ENTROPY_GENERATOR=y
    CONFIG_STACK_CANARIES=y

    Seems to consistently make the application properly boot on a bricked board. If I flash a release build again on to the same board, this time with the stack canaries disabled, the board will brick again.

    I am still working on this to try and figure out more info, but in the meantime do you have any ideas on:

    1. Why I'm hitting that bus fault in the bootloader?
    2. Why does enabling compiler stack canaries let my application boot normally?
Related