Hardfault on bootloader jump to app for release build, works with debug build

Hello,

We have a head-scratching issue
For the sake of brevity, please assume in the following that the definition of "it works" is "the bootloader manages to jump into the app, which then runs and enumerates on USB".


Development setup:
We have a custom device based on the nRF52820, SDK 17.1.0, no softdevice, gcc toolchain.
Our build system is based on CMake, since it integrates well with the CLion IDE we use. The cmake files are a fork of Polidea's. Here is the fork we use (master branch).
We flash the MBR, Bootloader and application with bootloader settings page.
The linker files we use are part of our cmake fork above, here is the folder.
The bootloader uses this ldfile for the debug build and this ldfile for the release build.
The application uses this ldfile when flashed together with the debug BL, or this ldfile when flashed together with the release BL.

Both the application and bootloader use USB CDC ACM as interface for a custom protocol (the app) and the DFU protocol (the bootloader).
The application is based on FreeRTOS (we had to make a minor modification to the port files, since the context switching code assumes the presence of an FPU, which the 820 does NOT have)
The bootloader is essentially the Secure Bootloader with minor modifications for our custom board and integration into our CMake build.




The issue we observe is that the debug BL works whether the app is built for release OR debug.
The release BL only enumerates if no application is flashed, or an application is flashed but without bl_settings.
If an app with valid settings page is present, the app won't start and the USB won't enumerate.

The lack of USB enumeration seems to point to the fact that the BL is not simply refusing to boot the app (otherwise the bootloader would enumerate as USB device and it would loop, waiting for a DFU to start), but rather that either it tries to jump into the app, which then dies, or simply writing an app to the flash corrupts something.
In fact, attaching the debugger (to the release BL, with an app+bootloader settings flashed) is interesting: the bootloader reaches correctly app_start(), sets MSP and jumps to the app. Stepping after that finds the CPU looping forever at address 0x14ee, the application HardFault handler.

The same debugging steps, done on the debug build of the BL, show that the bootloader reaches correctly app_start(), and then jumps in the app, which goes on executing normally.

My guess is that there is something wrong with either the linker files, or some gcc argument that differs between the Release and the Debug build, however I can't find anything that is obviously wrong with that.

  • Hello,

    We have had some issues with the app_start() implementation failing when built without code optimization enabled (our bootloader examples are only tested with optimization enabled), but it's apparently the other way around in your case. And by failing, I mean that it jumps to the wrong address.

    Were you able to verify that app_start() starts jump to the correct address in your release build (i.e. to the application's reset handler at 0x1004)?

    Also, would you be able to share the *.map files for both builds here? I want to try compare them to see if I can spot anything.

    Best regards,

    Vidar 

  • Hello Vidar,

    Thank you for your reply.

    The application's reset handler, stored at 0x1004, is at 0x14D5 for the debug application and at 0x14C5 for the release application.
    I would say that the app_start() jump is correct, since I have placed a breakpoint at 0x14C4 (from the gdb cli, with command b *0x14C4) and it was hit after a reset and continue. 

    In case it's relevant, these are the gcc flags used for the build (respectively, common flags, flags for the debug build, flags for the release build):

    CMAKE_C_FLAGS: -MP -MD -mthumb -mabi=aapcs -Wall -g3 -ffunction-sections -fdata-sections -fno-strict-aliasing -fno-builtin --short-enums -mcpu=cortex-m4 -mfloat-abi=soft
    CMAKE_C_FLAGS_DEBUG: -g -O1 -DDEBUG -DDEBUG_NRF -DDEBUG_NRF_USER
    CMAKE_C_FLAGS_RELEASE: -O3 -DNDEBUG -O3

    Please find attached the map files for both Debug and Release builds of the bl+app below:
    app-release.map

    bl-release.map

    7522.app-debug.map

    5140.bl-debug.map

  • Hello,

    Thanks for confirming that the bootloader correctly loads the reset handler address from the reset vector. I forgot to ask if you could also check the main stack pointer register (MSP) when entering the reset handler. It should be same as the initial stack pointer value stored at address 0x1000.

    If the bootloader sets both PC and MSP to the correct value, then I think there may be reasons to suspect that the bootloader is not doing a proper cleanup of the peripheral configurations before booting the application. 

    Do you know when exception is raised, is it in the reset handler, or sometime after reaching main()? A stack trace ("bt" command in gdb) from the fault handler may help show where it happened.

    I did compare the *.map files you uploaded, but I could not spot any obvious errors, unfortunately. My first suspicion was that the linker could have discarded one of the section variables used to hold the event callbacks, and therefore caused the app to end up with an uninitialized function pointer ( I have seen it happen in cmake builds where the SDK is linked in as a pre-compiled library). But you should have gotten a hardfault regardless of how the bootloader was built if this really was the case.

  • Hello Vidar,

    The MSP seems correct as well, it is set to 0x20008000 when the call to jump_to_addr() is done.

    Unfortunately bt from gdb just gives me:

    (gdb) bt
    #0  0x000014ee in ?? ()

    The mystery gets even thicker now that I have noticed that it's not exactly that the release build altogether does not work: the issue seems to move with the optimisation level.
    If the bootloader is built with -O2 or -O3, I have that hardfault after the jump. If I build with -O1 or -Og, even in the release build, the application starts correctly.

    For reference, this is how our CMake sets the flags:

    add_definitions(-DBOARD_${NRF_BOARD} -DNRF52820_XXAA -DFLOAT_ABI_SOFT)
    set(CPU_FLAGS "-mcpu=cortex-m4 -mfloat-abi=soft")
    
    set(COMMON_FLAGS "-MP -MD -mthumb -mabi=aapcs -Wall -g3 -ffunction-sections -fdata-sections -fno-strict-aliasing -fno-builtin --short-enums ${CPU_FLAGS}")
    
    # compiler/assembler/linker flags
    set(CMAKE_C_FLAGS "${COMMON_FLAGS}")
    set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -Og -DDEBUG -DDEBUG_NRF -DDEBUG_NRF_USER")
    set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -O2")
    set(CMAKE_CXX_FLAGS "${COMMON_FLAGS}")
    set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -Og -DDEBUG -DDEBUG_NRF -DDEBUG_NRF_USER")
    set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O2")
    set(CMAKE_ASM_FLAGS "-MP -MD -std=c99 -x assembler-with-cpp")
    set(CMAKE_EXE_LINKER_FLAGS "-mthumb -mabi=aapcs -std=gnu++98 -std=c99 -L ${NRF5_SDK_PATH}/modules/nrfx/mdk -T${NRF5_LINKER_SCRIPT} ${CPU_FLAGS} -Wl,--gc-sections --specs=nano.specs -lc -lnosys -lm")

    If I set -O1 in CMAKE_C_FLAGS_RELEASE, the app Release build starts. If I set -O2 in CMAKE_C_FLAGS_DEBUG, the app Debug build does not start. So the issue is around optimisation flags, not about the build type.

  • Hello,

    It's good to know that it's related to the optimization level and not the build configuration, at least. I'm a bit surprised that the strack trace did not reveal more though. Next step I would suggest is that you  enable the HardFault handling library  in your app to see if you can get more information from the debug log it prints out. You can also read this information through GDB if you load the symbols from your application *.elf.

    For the hardfault library you need to include the following source files:

     -  /components/libraries/hardfault/hardfault_implementation.c

     -  /components/libraries/hardfault/nrf52/handler/hardfault_handler_gcc.c

    And set HARDFAULT_HANDLER_ENABLED to '1' in your sdk_config header.

Related