FOTA Doom Loop After Failed Download

Toolchain v2.70, SDK v2.7.0, VS Code, Windows

Kicked off an application FOTA update, and somehow it failed:

[00:02:06.161,315] <err> download_client: Error in recv(), errno 11
[00:02:06.168,243] <err> fota_download: Download client error
[00:02:06.174,652] <inf> dfu_target_mcuboot: MCUBoot image upgrade aborted.
[00:02:06.182,067] <inf> dfu_target_mcuboot: MCUBoot image upgrade aborted.

No problem, since it'll retry on next check-in, but:

[00:02:41.507,446] <inf> fota_download: Refuse fragment, restart with offset
[00:02:42.507,537] <inf> fota_download: Downloading from offset: 0x2b4ca
[00:02:43.202,728] <err> download_client: Unexpected HTTP response: 416 requested range not satisfiable
[00:02:43.212,768] <err> fota_download: Download client error
[00:02:43.219,207] <inf> dfu_target_mcuboot: MCUBoot image upgrade aborted.
[00:02:43.226,593] <inf> dfu_target_mcuboot: MCUBoot image upgrade aborted.

That offset is the size of the image, so the server is always going to reject it.

But instead of resetting its state and trying a full download next time, the client just repeats the range request.

And of course, that fails over and over again.

A reboot fixes it, but unrecoverable FOTA failures are bad news...

Parents Reply Children
  • Hi Mike, 
    I'm waiting for R&D team to take a look. The initial response is that they suspect it's the dfu_target library causing the problem as it's where the offset comes from. 

  • Hi Mike,

    I will continue to help with this ticket.

    Just want you to know that we have not forgotten this ticket. Devs are not done looking into it yet, but we are on it.

    A comment from me on previous things you said. Specifically:

    "A reboot fixes it, but unrecoverable FOTA failures are bad news..."

    and "It’s the fact that it gets stuck in a loop trying to recover."

    I definitely agree that the device should not enter a loop that it can not escape. However, software can in general introduce deadlocks, and it is hard to guarantee 100% that you can never get into one of those.

    However, it is a lot easier to guarantee a stable state on a reboot. This is why we have the Watchdog hardware peripheral. If set up properly, the Watchdog should reboot the device if something puts it into a deadlock. Is this a solution you have tried or considered?

  • A watchdog is great and I have it enabled, but it would not be triggered by firmware that is continually re-attempting a download as it would not be able to distinguish between that failure mode and, say, the mere lack of cellular signal. The watchdog can only catch low-level failures or those that are detected by the logic that kicks it. In this case, I have added specific code to spot repeated download failures and to reboot the device, but that required a priori knowledge of the problem.

  • That makes sense to me, thanks for the elaboration!

Related