This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

OTA DFU Stuck in pending

I have adapted and incorporated the NCS OTA DFU sample into my project. Initially, FOTA updates have been working. Now, after pushing an update, the image is stuck pending in slot=1 (see mcumgr output below). It does not swap to slot=0 after a reset. Additionally, I can't erase or overwrite this image from slot=1. I get various errors like "Mcu Mgr Error: BAD_STATE (6)" or "Mcu Mgr Error: NO_MEMORY (2)".

This device is already deployed so programming pins are inaccessible; remote management is the only way I can possibly recover this.

Lastly, and I'm not sure if this matters, I wanted to use the LittleFS but it was conflicting with NVS so I made this change the nrf repo:
devzone.nordicsemi.com/.../308671

Is there any I can complete this update or cancel it and start a new one?

sudo mcumgr --conntype ble <connection string> image list

Images:
 image=0 slot=0
    version: 0.0.0
    bootable: true
    flags: active confirmed
    hash: 8cc09b4293faaac3c0d613aaadbeab0aefc4fd87aa85ecc3ad29ac4edbfc3e41
 image=0 slot=1
    version: 0.0.0
    bootable: true
    flags: pending
    hash: 20e3bbee692b24aeb6bdb9cb5cf3fbb8ccc9929ce5cf58cebfc5f2ddf3c771cc
Split status: N/A (0)

BT40F (nRF5340)
git describe in sdk-nrf: v1.8.0-97-gb57588840
git describe in sdk-zephyr: v2.7.0-ncs1-25-g2dca349769

Parents
  • Hi Qualry, 

    Could you try to do a mcumgr image test with the hash of the 2nd image (instead of list) ? 

    By doing that mcumgr will execute the image on 2nd slot (change from pending to confirm) on the next reset. 

    If you have an error when doing that, could you send the log ? 

    Have you tried to reproduce the issue on a local device that you can attach a debugger on ? 

  • > Could you try to do a mcumgr image test with the hash of the 2nd image (instead of list) ?

    I am unable to do that. When I initially attempted an OTA update I uploaded the image, tested the image in the slot, then reset the device. Since then the image has remained in slot=1 with the pending status. In the nRF Device Manager app the option to test the image is grayed out. Also, `mcumgr <connection string> image test hash` doesn't give an error but also doesn't seem to do anything.

    > Have you tried to reproduce the issue on a local device that you can attach a debugger on ?

    No. Trying but have not been able too.

  • Hi Qualry, 

    Without debug information it would be hard for us to know what could be the reason. 

    I assume if you do FOTA update of the same image on the same application on your test device , you see no error ?

    And if you try to do another FOTA on the deployed device you receive error BAD_STATE and NO_MEMORY ? (please send us the log when you receive theses error)

    Have you tried to use the phone to do FOTA update ? 

    Do you store any other data in the flash   ? 


    Could you send the partitions.yml file of both the old and new image build. 

  • I ran into something similar recently when bootloading using external flash.  I had mistakenly had mcuboot_primary a different size from mcuboot_secondary (set using pm_statitic.yml).  I found that mcuboot would not try to test the new/pending image.  Instead, it would display an error, leave the pending flag on the secondary image as-is, and jump back to the application.

    It seems like this is a bug in mcuboot because a device can no longer accept new firmware updates.  The pending flag is set on the secondary image so this image cannot be deleted.  Resetting doesn't clear the pending flag because of the size mismatch between the primary and secondary images.  Performing an image test again doesn't do anything because the pending flag is already set.

    It would be better if mcuboot always cleared the pending flag (even if there is an error).  Alternatively, it would be good to be able to force a delete of the secondary image, even if the pending flag is set.  This would allow recovering from this "bricked" scenario.

    It is probably possible to recover from this using the serial recovery feature, but this isn't always feasible on a sealed device without exposed reset and UART pins.

  • > Could you send the partitions.yml file of both the old and new image build. 

    This is likely the issue: older partitions.yml  newer partitions.yml

    I have recreated this situation in a nrf5340dk. Here's the log: log output

    In my search I've seen similar people in my same predicament. It would be nice if this was implemented:

    https://github.com/apache/mynewt-mcumgr/issues/157

  • @TJ Stone: Thanks for the information. It's very useful. I do agree that there should be a away to fall back if the image couldn't be swapped. 

    @Qualry: Could you explain what you got with the nRF5340dk ? From what I can see in the log the Permanent swap was successful ? 

    From the partition files it seems that you added littlefs to the application and the size of the slots were reduced, but as far as I know this shouldn't be a problem ? 
    TJ was mentioned about the issue when the size of the secondary slot is actually bigger than primary slot causing it not possible to swap , but I guess it's not the case in your device  ? 

Reply
  • @TJ Stone: Thanks for the information. It's very useful. I do agree that there should be a away to fall back if the image couldn't be swapped. 

    @Qualry: Could you explain what you got with the nRF5340dk ? From what I can see in the log the Permanent swap was successful ? 

    From the partition files it seems that you added littlefs to the application and the size of the slots were reduced, but as far as I know this shouldn't be a problem ? 
    TJ was mentioned about the issue when the size of the secondary slot is actually bigger than primary slot causing it not possible to swap , but I guess it's not the case in your device  ? 

Children
  • I have re-ran this steps to cause this issue and saved image list outputs after each step along the way.

    First I upload the image (8cc09b) that introduces the littlefs. When I test the image and restart the device the image is swapped into the the first slot but the original image (c49DD8) doesn't show up in the second slot. Additionally, attempting to confirm 8cc09b doesn't actually do anything. It simply reverts back to the second slot on reset. At this point (step #6) I confirm the image without testing it first. After this the uploaded image remains in slot 1 after resetting.

  • Hi Qualry, 

    There must be something wrong when the littlefs is introduced. If you test doing MCUBoot without littlefs added do you see the same problem ? 

    It's a little bit unclear to me at step 07. Seems like after step 7 the image (8cc09b) is switched to slot 0 ? 

    On step 8 (20e3b) is updated but not able to switch to slot 0 ?  Is there any different in the partition map between the (20e3b) and (8cc09b) ? 

    Update: 

    I have checked a little bit with our team and it seems that indeed the littlefs partition is causing an issue. The reason is that MCUBoot store metadata (image trailer) at the end of the image flash area. In this case you have the littlefs area occupy from 0xf8000 0xfe000, covering the end address of slot 1 (of the old set up  0xfc000) . 
    My suggestion is try to limit the size of the littlefs to address 0xf8000 to 0xfb000 or move it down in the flash if the size can not be changed. 

    But anyway you may need to update MCUBoot because the image map on MCUBoot is fixed and it may cause you a trouble when you continue to do DFU in the future. 
    So it's very important that you match the partition map on the new image to the old image. 

  • > It's a little bit unclear to me at step 07. Seems like after step 7 the image (8cc09b) is switched to slot 0 ?

    Yes. However typically when you confirm an image and reset the device the images in slot 0 and slot 1 switch places. Yet what is happening here is that (8cc09b) swaps into slot 0 and (c49DD8) is nowhere to be seen.

    > Is there any different in the partition map between the (20e3b) and (8cc09b) ? 

    Yes. The different partitions.yml file are as in this comment: https://devzone.nordicsemi.com/f/nordic-q-a/86332/ota-dfu-stuck-in-pending/361077#361077


    > So it's very important that you match the partition map on the new image to the old image. 

    This looks like the solution moving forward. Unfortunately, it looks like my current devices are bricked.

  • Hi Qualry, 

    qualry said:
    Yes. However typically when you confirm an image and reset the device the images in slot 0 and slot 1 switch places. Yet what is happening here is that (8cc09b) swaps into slot 0 and (c49DD8) is nowhere to be seen.

    I think this reflected correctly the issue that the littlefs partition is covering the slot 1 image trailer. This is why you don't see the old image any more. 
    But I haven't seen the issue DFU Stuck in pending in your test. Could you clarify ? 

    I believe there should be a workaround if you place the littlefs location correctly, as long as it doesn't mess up with the image trailer. However in the long run it may not be the best option to use this workaround. 

  • > But I haven't seen the issue DFU Stuck in pending in your test. Could you clarify ?

    See how in step 9 I do an image reset and (20e3bbe) goes into pending? After a device reset in step (20e3bbe) remains in slot 1 with the pending flag instead of swapping into slot 0. It stays like this indefinitely.

    > I believe there should be a workaround if you place the littlefs location correctly,

    The solution that I'm going with moving forward is to simply not allow the partition layout to change between OTA upgrades by using a pm_static.yml file.

Related