This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Older DFU application valid issue

Hello,

There have been some questions on the forum regarding the DFU and it's zero value for the application CRC. However, none of them seem to address this question so I thought I would float it out here and see if anyone else has encountered this or even has a solution.

The product I am writing about is running an on an older SDK (5.2.0!) but it's been working and we haven't needed to upgrade it. Recently we had a problem come up with the application section of the FW on the nRF51822. After creating a fix for the issue, we had 400 units in inventory that needed to be upgraded prior to shipping. Rather than tear them apart to get at the programming connector, we decided to OTA bootload the new application FW into the devices. After it was all said and done 4 of the 400 units were "bricked".

The folks in production gave me one of the units to see if I could determine what happened. Note that we are using a dual bank DFU based on the SDK mentioned above. The only modifications I made to the example bootloader was to remove the LED and button handling, and replace it with a flag in the GPREGRET register to enter bootload mode from the application.

What I found was that the application bank was erased (all 0xFF), there was a valid binary image in the second bank. However in the applications settings page at the end of memory the application was marked as valid with a CRC of 0 (just like it would look when the whole image is programmed in the factory), and the second bank was marked as erased.

After examining the bootloader init code I determined that every time we reset, the bootloader would take control see the app valid flag and zero CRC and jump to the erased application resulting in a hard fault. The device was indeed bricked.

I did some more digging and I think I found the problem. In the activate function of dfu_dual_b ank.c there is the following code that erases the application and writes the new binary in its place:

            // Stop the DFU Timer because the peer activity need not be monitored any longer.
        err_code = app_timer_stop(m_dfu_timer_id);
        APP_ERROR_CHECK(err_code);
        
        // Erase BANK 0.
        err_code = pstorage_raw_clear(&m_storage_handle_app, m_image_size);
        APP_ERROR_CHECK(err_code);

If the second pstorage operation fails and returns a non NRF_SUCCESS result, the assert in the call back handler will reset and leave the device in this state. If the code added a step to update the application status flag to erased, that would prevent this problem from happening.

  1. Is this a valid hypothesis?
  2. Does anyone have a cute way of putting the application CRC into the hex file as part of building a complete image (Soft Device, application, bootloader, application valid all in the same hex file)?

Right now I do get the application valid flag into the complete hex file, but I don't have a mechanism to add the CRC to it. If I can add the CRC of the application, it would also prevent this problem.

  • Hello,

    1. bootloader_app_is_valid() checks whether the first word DFU_BANK_0_REGION_START is erased or not. So it shouldn't have started the application if BANK 0 was completely erased at least.

    Do you have that check in your bootloader_app_is_valid() function?

    // There exists an application in CODE region 1.
    if (DFU_BANK_0_REGION_START == EMPTY_FLASH_MASK)
    {
        return false;
    }
    

    That said, without CRC check you can't be 100% sure if the image was partially erased/stored. A reset/timeout during the swap could have caused the device to get bricked as you described. I think might be safer to update the boot settings before erasing as you suggested:

    update_status.status_code = DFU_BANK_0_ERASED;
    
    bootloader_dfu_update_process(update_status);     
    
    // Erase BANK 0.
    err_code = pstorage_raw_clear(&m_storage_handle_app, m_image_size);
    APP_ERROR_CHECK(err_code);
    

    Worst case is then that it will fall back into DFU mode rather than attempting to boot a non-existent application and cause a hardfault exception.

    1. You could inject the CRC for bank 0 into the application valid.hex you have, use nrfjprog to write it manually after loading the merged .hex file, or initialize the m_boot_settings with the CRC and app valid flag in your bootloader code ( .bank_0 = BANK_VALID_APP and .bank_0_crc = CRC of app.bin) .
  • Hi Vidar,

    Thanks for your response. Regarding the check for the empty flash mask, that is there but I think there is a problem with it. Doesn't the DFU_BANK_0_REGION_START need to be a pointer? Because there is no way 0x14000 is going to equal 0xFFFFFFFF. I'm still recovering from a nasty cold so forgive me if I got that wrong.

    I was thinking the same thing on your second statement.

    One last question: Can't the events that the stack sends up when flash operations complete return a result that is not NRF_SUCCESS? I believe I've seen that in the past and in this case it would trigger a reset in the middle of the bootload operation.

  • Hi John,

    you're right, the empty flash check is clearly a mistake, I didn't notice that. You can typecast it into a pointer and then deference it to make it work.

    The SD will report NRF_EVT_FLASH_OPERATION_ERROR if it failed to complete a flash operation. In that case the bootloader will reset itself (error handler) instead of trying it again as it is not expected to ever time out during an update. I know there have been made some bugfixes on the pstorage module since SDK 5.2.0, and maybe the problem related to the pstorage module itself and not time out in the softdevice.

  • Thanks Vidar! That confirms what I thought was happening.

Related