Release build with MCUboot occasionally bricks custom board

I'm developing for a custom board that has an nRF52832 and an FM25W04 external flash. My application uses nRF Connect SDK v2.9.2. I have a debug build that doesn't use a bootloader, and a release build that uses MCUboot as the bootloader. I've set up my partitions so that the bootloader, primary image, and scratch slots are on internal flash and the secondary image slot is on external flash.

The issue I am seeing is that flashing my release build onto one of my custom boards will occasionally brick the device. This is not a true brick however, because if I flash a debug build onto the bricked board the app will run fine, but then if I reflash the release build the board will brick again. Flashing an identical release build onto an identical custom board will usually not brick the second board.

I am having a difficult time debugging this problem and was hoping to get some advice on how to move forward. I believe this could be due to an issue with how I've configured my external flash chip via devicetree and/or Kconfig, or possibly an issue with how I've setup my partitions to place the secondary image slot on external flash. I'm providing some relevant snippets from files below, please let me know if any additional context is needed.

pinctrl.dtsi:

&pinctrl {
	spi2_default: spi2_default {
		group1 {
			psels = <NRF_PSEL(SPIM_MISO, 0, 19)>;
			bias-pull-up;
		};
		group2 {
			psels = <NRF_PSEL(SPIM_MOSI, 0, 18)>,
					<NRF_PSEL(SPIM_SCK, 0, 17)>;
		};
	};

	spi2_sleep: spi2_sleep {
		group1 {
			psels = <NRF_PSEL(SPIM_MISO, 0, 19)>,
			        <NRF_PSEL(SPIM_MOSI, 0, 18)>,
			        <NRF_PSEL(SPIM_SCK, 0, 17)>;
			low-power-enable;
		};
	};
};

.dts:

/ {
    chosen {
        zephyr,flash = &flash0;
        zephyr,code-partition = &slot0_partition;
        nordic,pm-ext-flash = &fm25w04;
    };
};

&flash0 {
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

        boot_partition: partition@0 {
            label = "mcuboot";
            reg = <0x00000000 0xC000>;
        };
        slot0_partition: partition@C000 {
            label = "image-0";
            reg = <0x0000C000 0x58000>;
        };
        scratch_partition: partition@64000 {
            label = "image-scratch";
            reg = <0x00064000 0x1000>;
        };
    };
};

&spi2 {
    compatible = "nordic,nrf-spi";
    status="okay";
    pinctrl-0 = <&spi2_default>;
    pinctrl-1 = <&spi2_sleep>;
    pinctrl-names = "default", "sleep";
    cs-gpios = <&gpio0 20 GPIO_ACTIVE_LOW>;

    fm25w04: fm25w04@0 {
        compatible = "jedec,spi-nor";
        reg = <0>;
        spi-max-frequency = <8000000>;
        jedec-id = [a1 28 13];
        size = <4194304>;
        has-dpd;
        t-enter-dpd = <3000>;
        t-exit-dpd = <3000>;
    };
};

&fm25w04 {
    partitions {
        compatible = "fixed-partitions";
        #address-cells = <1>;
        #size-cells = <1>;

        slot1_partition: partition@0 {
            label = "image-1";
            reg = <0x00000000 0x58000>;
        };
        config_partition: partition@58000 {
            label = "config";
            reg = <0x00058000 0x6000>;
        };
        record_partition: partition@5E000 {
            label = "record";
            reg = <0x0005E000 0x10000>;
        };
        nvs_storage: partition@6E000 {
            label = "nvs_storage";
            reg = <0x0006E000 0x6000>;
        };
    };
};

pm_static.yml:

mcuboot:
  address: 0x00000000
  size: 0xC000
mcuboot_primary:
  address: 0x0000C000
  size: 0x58000
  span: [mcuboot_pad, mcuboot_primary_app]
mcuboot_pad:
  address: 0x0000C000
  size: 0x200
mcuboot_primary_app:
  address: 0x0000C200
  size: 0x57E00
  span: [app]
mcuboot_scratch:
  address: 0x00064000
  size: 0x1000
flash_guard:
  address: 0x00065000
  size: 0x1B000

external_flash:
  device: fm25w04
  region: external_flash
  address: 0x00000
  size: 0x80000
  span: [mcuboot_secondary, config, record, nvs_storage]

mcuboot_secondary:
  region: external_flash
  address: 0x00000000
  size: 0x58000
config:
  region: external_flash
  address: 0x00058000
  size: 0x6000
record:
  region: external_flash
  address: 0x0005E000
  size: 0x10000
nvs_storage:
  region: external_flash
  address: 0x0006E000
  size: 0x6000

child_image/mcuboot.overlay:

&spi2 {
    compatible = "nordic,nrf-spi";
    status="okay";
    pinctrl-0 = <&spi2_default>;
    pinctrl-1 = <&spi2_sleep>;
    pinctrl-names = "default", "sleep";
    cs-gpios = <&gpio0 20 GPIO_ACTIVE_LOW>;

    fm25w04: fm25w04@0 {
        compatible = "jedec,spi-nor";
        reg = <0>;
        spi-max-frequency = <8000000>;
        jedec-id = [a1 28 13];
        size = <4194304>;
        has-dpd;
        t-enter-dpd = <3000>;
        t-exit-dpd = <3000>;
    };
};

/ {
    chosen {
        nordic,pm-ext-flash = &fm25w04;
    };
};

child_image/mcuboot/prj.conf:

CONFIG_GPIO=y
CONFIG_SPI=y
CONFIG_SPI_NOR=y
CONFIG_SPI_NOR_SFDP_RUNTIME=y
CONFIG_SPI_NOR_FLASH_LAYOUT_PAGE_SIZE=4096
CONFIG_NORDIC_QSPI_NOR=n
CONFIG_MULTITHREADING=y
CONFIG_FLASH=y
CONFIG_PM_EXTERNAL_FLASH_MCUBOOT_SECONDARY=y
CONFIG_PM_OVERRIDE_EXTERNAL_DRIVER_CHECK=y
CONFIG_BOOT_MAX_IMG_SECTORS=128
CONFIG_BOOT_SWAP_USING_SCRATCH=y
CONFIG_WATCHDOG=y
CONFIG_BOOT_WATCHDOG_FEED=y

  • Hi Edvin, sorry for the delay. I had to work on other stuff and am just getting back to this now. I have been making sure that the CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION value is higher than the current value when I perform a DFU. I've also been erasing the secondary image slot on the external flash, but that doesn't fix the "bricking" issue.

    I've been doing some more debugging by trying to look at the bootloader logs and I was able to capture these log messages once on a bricked board:

    [00:00:00.599,395] <err> os: ***** BUS FAULT *****
    [00:00:00.599,700] <err> os:   Imprecise data bus error
    [00:00:00.600,006] <err> os: r0/a1:  0x8bbbf596  r1/a2:  0x1abd4294  r2/a3:  0x00000128
    [00:00:00.605,468] <err> os: r3/a4:  0x00000001 r12/ip:  0x00000040 r14/lr:  0x0000a5db
    [00:00:00.605,895] <err> os:  xpsr:  0x21000000
    [00:00:00.606,170] <err> os: Faulting instruction address (r15/pc): 0x0000a1b6
    [00:00:00.606,567] <err> os: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0
    [00:00:00.606,964] <err> os: Current thread: 0x20000660 (unknown)
    [00:00:00.607,299] <err> os: Halting system

    I then ran addr2line:

    % arm-zephyr-eabi-addr2line -e build-release/mcuboot/zephyr/zephyr.elf 0x0000a1b6
    /opt/nordic/ncs/v2.9.2/modules/crypto/mbedtls/library/bignum.c:936

    So it looks like there was a fault in this function in mbedtls:

    /*
     * Compare signed values
     */
    int mbedtls_mpi_cmp_mpi(const mbedtls_mpi *X, const mbedtls_mpi *Y)
    {
        size_t i, j;
    
        for (i = X->n; i > 0; i--) {
            if (X->p[i - 1] != 0) {
                break;
            }
        }
    
        for (j = Y->n; j > 0; j--) {
            if (Y->p[j - 1] != 0) {
                break;
            }
        }
    
        if (i == 0 && j == 0) {
            return 0;
        }
    
        if (i > j) {
            return X->s;
        }
        if (j > i) {
            return -Y->s;
        }
    
        if (X->s > 0 && Y->s < 0) {
            return 1;
        }
        if (Y->s > 0 && X->s < 0) {
            return -1;
        }
    
        for (; i > 0; i--) {
            if (X->p[i - 1] > Y->p[i - 1]) {
                return X->s;
            }
            if (X->p[i - 1] < Y->p[i - 1]) {
                return -X->s;
            }
        }
    
        return 0;
    }

    I have not been able to determine any more information from this though and am still working on it. 

    Also, another thing I have found is that enabling the compiler stack canary in my child_image/mcuboot/prj.conf:

    CONFIG_ENTROPY_GENERATOR=y
    CONFIG_STACK_CANARIES=y

    Seems to consistently make the application properly boot on a bricked board. If I flash a release build again on to the same board, this time with the stack canaries disabled, the board will brick again.

    I am still working on this to try and figure out more info, but in the meantime do you have any ideas on:

    1. Why I'm hitting that bus fault in the bootloader?
    2. Why does enabling compiler stack canaries let my application boot normally?
  • Can you please try to remove the configs CONFIG_STACK_CANARIES=y and CONFIG_ENTROPY_GENERATOR (if that makes the issue happen again), and then try to add CONFIG_DEBUG_THREAD_INFO=y in that same file, to see if you can determine what thread that is causing this issue. The idea then is to try to increas this thread's stack size. It is typically done through a Kconfig option, but I don't know which one to search for without knowing the name of the thread that caused the issue.

    Best regards,

    Edvin

Related