dfu issues after migrating from ncs2.9.2 to ncs3.0.2

hello Nordic

we are working with nrf52840/nrf52832 with ncs (currently on v3.0.2)

i have noticed that when i try to dfu a merged artifact of my version with the new mcuboot of ncs3.0.2 the device does not work after the dfu at all, it seems bricked no matter if i try to reset it after or even try to re flash the same version of merged file with the same mcuboot of ncs3.0.2 it does not respond .. only after flashing a lower version it returns to work (this part is very strange i think)

in previous version with ncs2.9.2 there was no issues with the dfu ..

under app/
sysbuild.conf
SB_CONFIG_BOOTLOADER_MCUBOOT=y
SB_CONFIG_BOOT_SIGNATURE_TYPE_RSA=y
SB_CONFIG_PARTITION_MANAGER=y
SB_CONFIG_PM_OVERRIDE_EXTERNAL_DRIVER_CHECK=y

sysbuild_swap.conf
SB_CONFIG_MCUBOOT_MODE_SWAP_SCRATCH=y

sysbuild_no_swap.conf
SB_CONFIG_MCUBOOT_MODE_SWAP_SCRATCH=n
SB_CONFIG_BOOT_SIGNATURE_TYPE_ECDSA_P256=y

prj.conf under app/sysbuild/my_app/mcuboot/
CONFIG_FLASH=y
CONFIG_FLASH_xxxx_API=x
CONFIG_MAIN_STACK_SIZE=xxxx

1. any ideas what can be the cause and how can this be fixed ?

2. any idea how to debug the mcuboot using rtt ? i don't want to flash a version via cursor or vscode because then dfu will not work for some other signature reason 

best regards

Ziv

Parents
  • Hello Ziv,

    Do I understand correctly that the application does not boot even after completely re-flashing the device with the merged.hex from the latest version? That would suggest that there is an issue with the new project/app and not really related to the DFU process. First step I would recommend is that you try to debug the new project to find out where the program hangs. Does the app run ok if you build without the bootloader?

    Best regards,

    Vidar

  • the app runs fine and it was also dfued to our setups, which runs with older mcuboots on them, with no issues.

    it also runs fine after first flash of the merged hex with mcuboot and app of ncs3.0.2, but after dfu it seems briked and only after few flashes with older version merges the device run again (this was very strange to me since i flash with the flag of --chipearse" i did not expect that first flash will not work)

  • Are you using external flash (SPI/QSPI) for staging FW updates and are you using the same static PM file for both versions of your project?

  • we do dfu via ble 

    we have external flash via spi .. and we use it for saving one image slot and other things off course

    we have a pm_static_board_name.yml for each board

    hut we also have board_name.overlay for each board.. i think i read in the past one of them is overwritten or ignored in zephyr if you have the other (don't really remember) 

  • ziv123 said:
    we have external flash via spi .. and we use it for saving one image slot and other things off course

    The bootloader will not boot the application if it fails to access the secondary slot in external flash so I wonder if the issue might be that the SPI fiash has gotten into a bad state somehow. Does your HW allow you to power cycle the board so that both the SPI flash and nRF get reset?

    ziv123 said:
    we have a pm_static_board_name.yml for each board

    You can check the pm .yml file in the build folder to verify that the same pm file is applied in both builds. Alternatively, run "west build -t partition_manager_report".

    ziv123 said:
    hut we also have board_name.overlay for each board.. i think i read in the past one of them is overwritten or ignored in zephyr if you have the other (don't really remember) 

    Devicetree partitions are ignored when using the partition manager, yes.

  • hi Vidar

    sorry for the delay in response i waited for the HW check and some test that i run

    Does your HW allow you to power cycle the board so that both the SPI flash and nRF get reset?

    the design is only based on nRF52840 reset control, then no it resets only the nRF, not the SPI flash.

    i can also add the following findings:

    • dfu from device with mcuboot v3.0.2 to an app with v3.0.2 fails

    (but after i flash a merged version of v2.9.2 i i did see the app comes up with the new app version 3.0.2 and after power off power on it came up with the v2.9.2 version (this is strange) however i did not manage to relet this scenario so i am not sure how valid this result is)

    • dfu from device with mcuboot v2.9.2 to an app with v3.0.2 pass
    • dfu from device with mcuboot v3.0.2 to an app with v2.9.2 fails

    pm_static between app v2.9.2 and app v3.0.2 did not change so same pm static

    i added logs to mcuboot and flashed this version (i had to flash it with --pinreset for some reason else it crashed at start) and after dfu +nrfjprog --reset i get this logs 

    00> [00:00:00.250,396] <inf> AUGU_GPIO_POWER: augu_power_resource_init: boost_on_node
    00> [00:00:00.250,793] <inf> AUGU_GPIO_POWER: augu_power_resource_init: hw_id_on_node
    00> [00:00:00.251,220] <inf> AUGU_GPIO_POWER: augu_power_resource_init: sensors_on_node
    00> [00:00:00.251,617] <inf> AUGU_GPIO_POWER: augu_power_resource_init: flash_on_node
    00> [00:00:00.252,044] <inf> AUGU_FLASH_W25N: master bus found: spi@40023000
    00> [00:00:00.252,410] <inf> AUGU_FLASH_W25N: turning flash on.[00:00:03.136,077] <err> os: ***** USAGE FAULT *****
    00> [00:00:03.136,383] <err> os:   Illegal use of the EPSR
    00> [00:00:03.136,688] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
    00> [00:00:03.137,115] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
    00> [00:00:03.137,542] <err> os:  xpsr:  0x00000000
    00> [00:00:03.137,847] <err> os: Faulting instruction address (r15/pc): 0x00000000
    00> [00:00:03.138,244] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
    00> [00:00:03.138,641] <err> os: Current thread: 0x20000b20 (unknown)
    00> [00:00:03.139,007] [1;31m<err> os: Halting system

    another strange things that after dfu fails with mcuboot v3.0.2 i hae to flash the device twice with an older version for it to start working again, after first flash of the merged hex with --chiperase trying to reset or power off and power on the device just don't do it .. i wonder why it gets sttuck like this and why the second flashing of the same merged he as the first flash fixed it.

    hope to read you soon

    best regards

    Ziv

Reply
  • hi Vidar

    sorry for the delay in response i waited for the HW check and some test that i run

    Does your HW allow you to power cycle the board so that both the SPI flash and nRF get reset?

    the design is only based on nRF52840 reset control, then no it resets only the nRF, not the SPI flash.

    i can also add the following findings:

    • dfu from device with mcuboot v3.0.2 to an app with v3.0.2 fails

    (but after i flash a merged version of v2.9.2 i i did see the app comes up with the new app version 3.0.2 and after power off power on it came up with the v2.9.2 version (this is strange) however i did not manage to relet this scenario so i am not sure how valid this result is)

    • dfu from device with mcuboot v2.9.2 to an app with v3.0.2 pass
    • dfu from device with mcuboot v3.0.2 to an app with v2.9.2 fails

    pm_static between app v2.9.2 and app v3.0.2 did not change so same pm static

    i added logs to mcuboot and flashed this version (i had to flash it with --pinreset for some reason else it crashed at start) and after dfu +nrfjprog --reset i get this logs 

    00> [00:00:00.250,396] <inf> AUGU_GPIO_POWER: augu_power_resource_init: boost_on_node
    00> [00:00:00.250,793] <inf> AUGU_GPIO_POWER: augu_power_resource_init: hw_id_on_node
    00> [00:00:00.251,220] <inf> AUGU_GPIO_POWER: augu_power_resource_init: sensors_on_node
    00> [00:00:00.251,617] <inf> AUGU_GPIO_POWER: augu_power_resource_init: flash_on_node
    00> [00:00:00.252,044] <inf> AUGU_FLASH_W25N: master bus found: spi@40023000
    00> [00:00:00.252,410] <inf> AUGU_FLASH_W25N: turning flash on.[00:00:03.136,077] <err> os: ***** USAGE FAULT *****
    00> [00:00:03.136,383] <err> os:   Illegal use of the EPSR
    00> [00:00:03.136,688] <err> os: r0/a1:  0x00000000  r1/a2:  0x00000000  r2/a3:  0x00000000
    00> [00:00:03.137,115] <err> os: r3/a4:  0x00000000 r12/ip:  0x00000000 r14/lr:  0x00000000
    00> [00:00:03.137,542] <err> os:  xpsr:  0x00000000
    00> [00:00:03.137,847] <err> os: Faulting instruction address (r15/pc): 0x00000000
    00> [00:00:03.138,244] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
    00> [00:00:03.138,641] <err> os: Current thread: 0x20000b20 (unknown)
    00> [00:00:03.139,007] [1;31m<err> os: Halting system

    another strange things that after dfu fails with mcuboot v3.0.2 i hae to flash the device twice with an older version for it to start working again, after first flash of the merged hex with --chiperase trying to reset or power off and power on the device just don't do it .. i wonder why it gets sttuck like this and why the second flashing of the same merged he as the first flash fixed it.

    hope to read you soon

    best regards

    Ziv

Children
  • Hi Ziv,

    I guess the usage fault is likely what prevents the bootloader from booting the app, and not that the bootloader is failing to access the flash, but it is strange that the exception stack frame is all zeroes. Do you have any calls memset() or memcpy() in AUGU_FLASH_W25N or AUGU_GPIO_POWER that could potentially be zeroing the stack? 

    Best regards,

    Vidar

  • any calls memset() or memcpy()

    no such calls in those modules.

    i tried with another version with logs and added stack sentinal  got this after reset after dfu stopped and i re flashed the merged hex:

    00> [1779891635] <inf> mcuboot_util: Image index: 0, Swap type: none
    00> [1779891635] <inf> mcuboot_util: Image index: 0, Swap type: none
    00> [1779891635] <inf> mcuboot_util: Image index: 0, Swap type: none
    00> [1779891635] <inf> SMP_SERVICE: dfu has started. configuring connection watchdog to 10 minutes
    00> [1779891635] <inf> COMM_MNG: ble_connection_config updated!
    00> [1779891635] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7980000, size 655360, blocks_to_erase=5
    00> [1779891635] <inf> mcumgr_img_grp: Erased 0xa0000 bytes of image slot
    00> [1779891635] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a40000, size 131072, blocks_to_erase=1
    00> [1779891635] <inf> mcumgr_img_grp: Erased 0x20000 bytes of image slot trailer
    00> [1779891636] <inf> COMM_MNG: le_param_updated: LE conn param updated: interval 0x0006 latency 0 timeout 400
    00> [1779891636] <inf> COMM_MNG: RSSI is -47
    00> [00:00:00.250,427] <inf> AUGU_GPIO_POWER: augu_power_resource_init: boost_on_node
    00> [00:00:00.250,823] <inf> AUGU_GPIO_POWER: augu_power_resource_init: hw_id_on_node
    00> [00:00:00.251,220] <inf> AUGU_GPIO_POWER: augu_power_resource_init: sensors_on_node
    00> [00:00:00.251,617] <inf> AUGU_GPIO_POWER: augu_power_resource_init: flash_on_node
    00> [00:00:00.252,014] <inf> AUGU_FLASH_W25N: master bus found: spi@40023000
    00> [00:00:00.252,380] <inf> AUGU_FLASH_W25N: turning flash on...
    00> [00:00:00.252,746] <inf> AUGU_FLASH_W25N: flash_w25n_spi_init done
    00> [00:00:00.253,051] <inf> AUGU_FLASH_W25N: spi init ok
    00> [00:00:00.253,356] <inf> AUGU_FLASH_W25N: power on ok, sleeping 50ms
    00> [00:00:00.303,771] <inf> AUGU_FLASH_W25N: reading jedec id (1)
    00> [00:00:00.304,138] <inf> AUGU_FLASH_W25N: reading jedec id (2)
    00> [00:00:00.304,534] <inf> AUGU_FLASH_W25N: flash jedec device id 0xefba21
    00> [00:00:00.306,915] <inf> AUGU_FLASH_W25N: bad block management lut markdown found
    00> [00:00:00.307,281] <inf> AUGU_FLASH_W25N: flash_w25n init done
    00> *** Booting My Application v2.1.0-dev-ae1ee57f3906 ***
    00> *** Using nRF Connect SDK v3.0.2-89ba1294ac9b ***
    00> *** Using Zephyr OS v4.0.99-f791c49f492c ***
    00> [00:00:00.308,227] <inf> mcuboot: Starting bootloader
    00> [00:00:00.312,896] <inf> mcuboot: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
    00> [00:00:00.313,385] <inf> mcuboot: Scratch: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
    00> [00:00:00.313,812] <inf> mcuboot: Boot source: primary slot
    00> [00:00:00.319,122] <inf> mcuboot: Image index: 0, Swap type: test
    00> [00:00:03.057,678] <err> os: r0/a1:  0x00000002  r1/a2:  0xf0f0f0f0  r2/a3:  0x200073c8
    00> [00:00:03.058,105] <err> os: r3/a4:  0x00000002 r12/ip:  0x00000001 r14/lr:  0x00008149
    00> [00:00:03.058,532] <err> os:  xpsr:  0x01000000
    00> [00:00:03.058,807] <err> os: Faulting instruction address (r15/pc): 0x00007f7e
    00> [00:00:03.059,204] <err> os: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
    00> [00:00:03.059,600] <err> os: Current thread: 0x200019a0 (unknown)
    00> [00:00:03.059,936] <err> os: Halting system

    don't know if it helps to understand more

    i also tried increasing the stack for mcuboot CONFIG_MAIN_STACK_SIZE=10240 (it was set to 2k before)

    and i got this after dfu:

     [1779893114] <inf> mcuboot_util: Image index: 0, Swap type: none
    00> [1779893114] <inf> mcuboot_util: Image index: 0, Swap type: none
    00> [1779893114] <inf> mcuboot_util: Image index: 0, Swap type: none
    00> [1779893114] <inf> SMP_SERVICE: dfu has started. configuring connection watchdog to 10 minutes
    00> [1779893114] <inf> COMM_MNG: ble_connection_config updated!
    00> [1779893114] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7980000, size 655360, blocks_to_erase=5
    00> [1779893114] <inf> mcumgr_img_grp: Erased 0xa0000 bytes of image slot
    00> [1779893114] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a40000, size 131072, blocks_to_erase=1
    00> [1779893114] <inf> mcumgr_img_grp: Erased 0x20000 bytes of image slot trailer
    00> [1779893115] <inf> COMM_MNG: le_param_updated: LE conn param updated: interval 0x0006 latency 0 timeout 400
    00> [1779893115] <inf> COMM_MNG: RSSI is -45
    00> [00:00:00.250,396] <inf> AUGU_GPIO_POWER: augu_power_resource_init: boost_on_node
    00> [00:00:00.250,793] <inf> AUGU_GPIO_POWER: augu_power_resource_init: hw_id_on_node
    00> [00:00:00.251,190] <inf> AUGU_GPIO_POWER: augu_power_resource_init: sensors_on_node
    00> [00:00:00.251,617] <inf> AUGU_GPIO_POWER: augu_power_resource_init: flash_on_node
    00> [00:00:00.252,014] <inf> AUGU_FLASH_W25N: master bus found: spi@40023000
    00> [00:00:00.252,380] <inf> AUGU_FLASH_W25N: turning flash on...
    00> [00:00:00.252,716] <inf> AUGU_FLASH_W25N: flash_w25n_spi_init done
    00> [00:00:00.253,051] <inf> AUGU_FLASH_W25N: spi init ok
    00> [00:00:00.253,356] <inf> AUGU_FLASH_W25N: power on ok, sleeping 50ms
    00> [00:00:00.303,771] <inf> AUGU_FLASH_W25N: reading jedec id (1)
    00> [00:00:00.304,138] <inf> AUGU_FLASH_W25N: reading jedec id (2)
    00> [00:00:00.304,534] <inf> AUGU_FLASH_W25N: flash jedec device id 0xefba21
    00> [00:00:00.306,945] <inf> AUGU_FLASH_W25N: bad block management lut markdown found
    00> [00:00:00.307,342] <inf> AUGU_FLASH_W25N: flash_w25n init done
    00> *** Booting My Application v2.1.0-dev-ae1ee57f3906 ***
    00> *** Using nRF Connect SDK v3.0.2-89ba1294ac9b ***
    00> *** Using Zephyr OS v4.0.99-f791c49f492c ***
    00> [00:00:00.308,288] <inf> mcuboot: Starting bootloader
    00> [00:00:00.313,262] <inf> mcuboot: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
    00> [00:00:00.313,751] <inf> mcuboot: Scratch: magic=bad, swap_type=0x1, copy_done=0x2, image_ok=0x2
    00> [00:00:00.314,208] <inf> mcuboot: Boot source: primary slot
    00> [00:00:00.319,488] <inf> mcuboot: Image index: 0, Swap type: test
    00> [00:00:03.216,064] <inf> mcuboot: Starting swap using scratch algorithm.
    00> [00:00:03.216,491] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a60000, size 131072, blocks_to_erase=1
    00> [00:00:03.309,570] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a60000, size 131072, blocks_to_erase=1
    00> [00:00:03.858,398] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a40000, size 131072, blocks_to_erase=1
    00> [00:00:03.860,290] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a00000, size 131072, blocks_to_erase=1
    00> [00:00:08.480,316] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a60000, size 131072, blocks_to_erase=1
    00> [00:00:09.029,144] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x79e0000, size 131072, blocks_to_erase=1
    00> [00:00:13.648,193] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a60000, size 131072, blocks_to_erase=1
    00> [00:00:14.197,021] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x79c0000, size 131072, blocks_to_erase=1
    00> [00:00:18.817,993] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a60000, size 131072, blocks_to_erase=1
    00> [00:00:19.366,821] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x79a0000, size 131072, blocks_to_erase=1
    00> [00:00:23.986,083] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7a60000, size 131072, blocks_to_erase=1
    00> [00:00:24.534,912] <inf> AUGU_FLASH_W25N: flash erase to addr: 0x7980000, size 131072, blocks_to_erase=1
    00> [00:00:30.364,501] <inf> mcuboot: Bootloader chainload address offset: 0xf000
    00> [00:00:30.364,898] <inf> mcuboot: Image version: v2.2.3
    00> [00:00:30.365,203] <inf> mcuboot: Jumping to the first image slot
    00> [00:00:30.365,539] <err> os: SPSEL in thread mode does not indicate PSP
    00> [00:00:30.365,539] <err> os: ***** HARD FAULT *****
    00> [00:00:30.365,539] <err> os:   Fault escalation (see below)
    00> [00:00:30.365,539] <err> os: ***** USAGE FAULT *****
    00> [00:00:30.365,539] <err> os:   Illegal use of the EPSR
    00> [00:00:30.365,539] <err> os: r0/a1:  0x20009bc8  r1/a2:  0x00002d75  r2/a3:  0x0000a2ef
    00> [00:00:30.365,539] <err> os: r3/a4:  0x00002d61 r12/ip:  0x00002d61 r14/lr:  0x00002d61
    00> [00:00:30.365,539] <err> os:  xpsr:  0x00000000
    00> [00:00:30.365,539] <err> os: Faulting instruction address (r15/pc): 0x00002d61
    00> [00:00:30.365,539] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
    00> [00:00:30.365,539] <err> os: Current thread: 0x200019a0 (unknown)
    00> [00:00:30.365,539] [1;31m<err> os: Halting system

    i also understand that mcuboot on ncs3.0.2 is a bit bigger and uses some other encryption that maybe require more stack .. maybe that is related somehow ? 

Related