This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nrfutil Mesh DFU flash erase timeout - perhaps

Hi.

I am trying to get mesh DFU working and I keep getting "Failed to upgrade target. Error is: Crashed on start packet" from nrfutil.

When debugging the target I see:
<t:     882920>, nrf_mesh_dfu.c,  344, Erase complete (0x8D000)
<t:     882923>, nrf_mesh_dfu.c,  348, Flash idle.
After getting the ".. Crashed on start packet" error.

This happens consistantly (10/10 tries)  when my device page contains a public key.
When using a device page without a public key it sometimes starts. When it does start (maybe 1 in 3-4), I get "Erase complete (0x8D000)" a split second before it usually gives up.

Hence I think there is a timing issue with the current implementation.

I am using S140 v7.0.1 with mesh SDK 4.0.0 and the dfu example running on two nRF52840 DK's.

What can be done to fix the issue?

P.S. why is the nrfutil for mesh stuck on Python 2.7?

Parents
  • Hi Tardis, 

    Even without the public key, have you managed to get any DFU session succeeded ? 

    Have you tried to do DFU with a very small image, for example the Blinky application as in the documentation ? 

    I don't think it's an erase timeout as you can find the flash was idle. 

    I haven't seen the " Crashed on start packet" before. If you still have issue updating the blinky application, please send us your files (bootloader ,dfu application, DFU .zip file of the blinky, device page ). Could you try again on another PC ? 

    We havent got much update of the nRFutil lately so it's still the same as a few years ago and stuck at Python 2.7 .We will do an update in the future. 

  • Yes. But it requires several attempts to get it going.

    I tried sending blinky_nrf52840_xxAA_s140_7.0.1.hex from mesh\bin\blinky and it seemed to go through in the first try. But it is not what I'm trying to do.
    Blinky is not representative of a real mesh application imho.

    I compile the dfu example (debug build) to dfu_nrf52840_xxAA_s140_7.0.1.hex
    So I can have a debug session running and see what is going on during the transfer.

    Then I generate a zip file using
    nrfutil dfu genpkg --application dfu_nrf52840_xxAA_s140_7.0.1.hex --company-id 0x00000059 --application-id 1 --application-version 2 --sd-req 0x00CA --mesh dfu_nrf52840_xxAA_s140_7.0.1.hex.zip

    device_page is generated from:
    {
        "bootloader_config": {
            "bootloader_id": 1,
            "bootloader_version": 1,
            "company_id": 89,
            "application_id": 1,
            "application_version": 1
         }
    }
    Using:
    python device_page_generator.py -d nrf52840_xxAA -sd "s140_7.0.1"


    Two identical DK boards are loaded using:
    nrfjprog -s %1 --chiperase --program s140_nrf52_7.0.1_softdevice.hex --verify s140_nrf52_7.0.1_softdevice.hex
    nrfjprog -s %1 --program mesh_bootloader_serial_gccarmemb_nrf52840_xxAA.hex --verify mesh_bootloader_serial_gccarmemb_nrf52840_xxAA.hex
    nrfjprog -s %1 --program dfu_nrf52840_xxAA_s140_7.0.1.hex --verify dfu_nrf52840_xxAA_s140_7.0.1.hex
    nrfjprog -s %1 --program device_page_nrf52840_xxAA_s140_7.0.1.hex --verify device_page_nrf52840_xxAA_s140_7.0.1.hex

    Using "bootloader_verify.py" I get (on both boards):
    Reading UICR..                  OK.
    Reading Device page..           OK.
    Resetting device..              OK (In application)
    Checking serial connection..    OK.

    I then start debugging one board:
    <t:          1>, main.c,  194, ----- Bluetooth Mesh DFU Example -----
    <t:      13354>, serial_handler_prov.c,  267, Generating encryption keypair...
    <t:      18151>, main.c,  200, Initialization complete!
    <t:      22830>, main.c,  221, Enabling serial interface...
    <t:      22833>, mesh_app_utils.c,   65, Device UUID (raw): xxxxxxxxxxxxxxxxxxxxxxxxx
    <t:      22836>, mesh_app_utils.c,   66, Device UUID : xxxxxxxxxxxxxxxxxxxxxxxxxx
    <t:      22847>, nrf_mesh_dfu.c,  538,     RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFE
    <t:      22852>, nrf_mesh_dfu.c,  576,     SERIAL TX!
    <t:      22859>, main.c,  229, DFU example started!

    Start the DFU using:
    nrfutil --verbose dfu serial -pkg dfu_nrf52840_xxAA_s140_7.0.1.hex.zip -p COM7 -b 115200 -fc --mesh

    From nrfutil I get:
    Upgrading target on COM7 with DFU package C:\xxxxx\dfu_nrf52840_xxAA_s140_7.0.1.hex.zip. Flow control is enabled.
    Flushing com-port...
    Opened com-port
    Starting DFU upgrade of type 4, SoftDevice size: 0, bootloader size: 0, application size: 122616
    Sending DFU start packet, afterwards we wait for the flash on target to be initialized before continuing.
    PC -> target: 0502aabbccdd
    target -> PC: 0582aabbccdd
    Got echo response
    Sending DFU init packet
    PC -> target: 1378fdff040f96df2d9759000000010002000000
    target -> PC: 16a6045900000001000200000059000000010001000000
    target -> PC: 03847800
    PC -> target: 1378fdff040f96df2d9759000000010002000000
    target -> PC: 03847800
    PC -> target: 1478fcff000096df2d97ffffffffbe77000000000c
    PC -> target: 1478fcff000096df2d97ffffffffbe77000000000c
    PC -> target: 1478fcff000096df2d97ffffffffbe77000000000c
    PC -> target: 1478fcff000096df2d97ffffffffbe77000000000c
    PC -> target: 1478fcff000096df2d97ffffffffbe77000000000c


    Failed to upgrade target. Error is: Crashed on start packet

    Possible causes:
    - bootloader, SoftDevice or application on target does not match the requirements in the DFU package.
    - baud rate or flow control is not the same as in the target bootloader.
    - target is not in DFU mode. If using the SDK examples, press Button 4 and RESET and release both to enter DFU mode.
    Closing serial port...

    And in the Debug Terminal I get:
    <t:    3799440>, nrf_mesh_dfu.c,  405,     New firmware!
    <t:    3799443>, main.c,  109, Requesting DFU transfer with bank at 0x0008D000
    <t:    3799446>, nrf_mesh_dfu.c,  538,     RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFD
    <t:    3799450>, nrf_mesh_dfu.c,  549, Killing a TX slot prematurely (repeats done: 12).
    <t:    3799454>, nrf_mesh_dfu.c,  576,     SERIAL TX!
    <t:    3803757>, nrf_mesh_dfu.c,  538,     RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFD
    <t:    3803761>, nrf_mesh_dfu.c,  549, Killing a TX slot prematurely (repeats done: 0).
    <t:    3803765>, nrf_mesh_dfu.c,  576,     SERIAL TX!
    <t:    3947034>, nrf_mesh_dfu.c,  445,     DFU start
    <t:    3947037>, nrf_mesh_dfu.c,  538,     RADIO TX! SLOT 5, count 6, interval: exponential, handle: FFFC
    <t:    4031158>, nrf_mesh_dfu.c,  344, Erase complete (0x8D000)
    <t:    4031161>, nrf_mesh_dfu.c,  348, Flash idle.
    <t:    6830620>, nrf_mesh_dfu.c,  312, ABORT Timeout fired @720453979
    <t:    6830623>, nrf_mesh_dfu.c,  538,     RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFE
    <t:    6830628>, nrf_mesh_dfu.c,  576,     SERIAL TX!

    The "Failed to upgrade target. Error is: Crashed on start packet" message is received after <t:    3947037> but before <t:    4031158>.
    Which is what leads me to think it's a timeout issue waiting for the flash to be erased.

    I don't have a different PC on hand but what would you think could be the issue?

  • Hi Tardis, 

    It's most likely the UART timeout was too short. Could you please change DEFAULT_SERIAL_PORT_TIMEOUT to 10 (or larger) inside dfu_transport_mesh.py in the nRFutil for mesh source code. 
    https://github.com/NordicSemiconductor/pc-nrfutil/blob/mesh_dfu/nordicsemi/dfu/dfu_transport_mesh.py#L147

    And re-compile nrfutil.exe (or install in python) and test again ? 

  • I've tried setting it to 10.0 - didn't help.

    Then I set it to 100.0 and it started on the 5th try.

    Still not usable and it doesn't seem to have any effect.
    Please get back to me when you have verified a working fix.

  • The problem is that it worked fine here. And this is the first time we got this report on the forum (the tool created 3 years ago) so I'm not sure why. 

    Could you please provide all hex files you used for your DFU so we can try here ? This include the bootloader, softdevice, device page and dfu app. 


    I also attached the files that I used, could you please test. You just need to run the script2.bat to flash all the files. 


    The script for nrfutil is: 


    nrfutil dfu serial -pkg dfu_test100kID2.zip -p COM7 -b 115200 -fc --mesh
    

    Please make sure you re-compile nrfutil after the modification. 

  • Ok I think I've made som progress.

    I was using the same device page hex file with both boards.
    Using a different one on the "DFU sender" it now starts every time.

    But now I get a new error

    <t:    1870209>, nrf_mesh_dfu.c,  538, 	RADIO TX! SLOT 6, count 3, interval: exponential, handle: FFFC
    <t:    1870222>, nrf_mesh_dfu.c,  339, Write complete (0x2003FE90)
    <t:    1870225>, nrf_mesh_dfu.c,  348, Flash idle.
    <t:    1886673>, nrf_mesh_dfu.c,  538, 	RADIO TX! SLOT 7, count 3, interval: exponential, handle: FFFC
    <t:    1886685>, nrf_mesh_dfu.c,  339, Write complete (0x2003FE90)
    <t:    1886689>, nrf_mesh_dfu.c,  348, Flash idle.
    <t:    1903653>, nrf_mesh_dfu.c,  538, 	RADIO TX! SLOT 1, count 3, interval: exponential, handle: FFFC
    <t:    1903666>, nrf_mesh_dfu.c,  339, Write complete (0x2003FE90)
    <t:    1903669>, nrf_mesh_dfu.c,  348, Flash idle.
    <t:    1922053>, nrf_mesh_dfu.c,  398, 	Abort event. Reason: 0x3

    And then it stops receiving before reaching 2%. This happens 5/5 tries.
    It still seems related to a timeout since I can get it to run past that point by pausing a specific point.
    If I press the pause button at this point (line 18)

    C:\Nordic\lontech_config\DFU Test>nrfutil --verbose dfu serial -pkg .\dfu_nrf52840_xxAA_s140_7.0.1_LED2.hex.zip -p COM7 -i 250 -b 115200 -fc --mesh
    Upgrading target on COM7 with DFU package C:\Nordic\lontech_config\DFU Test\dfu_nrf52840_xxAA_s140_7.0.1_LED2.hex.zip. Flow control is enabled.
    Flushing com-port...
    Opened com-port
    Starting DFU upgrade of type 4, SoftDevice size: 0, bootloader size: 0, application size: 122648
    Sending DFU start packet, afterwards we wait for the flash on target to be initialized before continuing.
    PC -> target: 0502aabbccdd
    target -> PC: 0582aabbccdd
    Got echo response
    Sending DFU init packet
    PC -> target: 1378fdff040f214dea9259000000010002000000
    target -> PC: 16a6045900000001000200000059000000e803e8030000
    target -> PC: 03847800
    PC -> target: 1378fdff040f214dea9259000000010002000000
    target -> PC: 03847800
    PC -> target: 1478fcff0000214dea92ffffffffc677000040000c
    target -> PC: 0da2020459000000010002000000
    target -> PC: 03847800

    And then wait for the target "Erase complete" message. Then let it resume.
    Then the the FW goes through.

    I have attached all the files I've used for this test
    DFU Test.zip

  • Hi Tardis,
    I will check the file you provided. 
    But could you try to test using my files here. I forgot to attach it. 

    TestFiles.zip

Reply Children
  • Hi Tardis, 

    Could you try again with ACK_WAIT_TIME set to 5.0 in dfu_transport_mesh.py  and recompile nrfutil


    I had a look at the UART trace and it's pointing out to the long delay when the chip has to erase multiple pages in flash (for the swap bank).

    In my case I'm testing with an image with size of 160kB meaning it's 40 pages. The time it take to erase a page can be up to 89ms. This result a CPU busy time of up to 3.56 seconds. The ACK_WAIT_TIME by default is 0.5 second and the nrfutil will try to send the message again.

    Because the CPU is busy when doing DFU, it seems that the UART packets are hang due to HWFC and it only continue when the CPU finished with flash erase. And after that the nRFUtil is confused and will stop. 
    You can see the screenshot here: 

    The small pulse in the middle is when the nrfutil tried to resend the packet but couldn't finish due to buffer overflow and HWFC stoped it. The time from A1 to A2 is 3.38s matched with the erasing time for 40 pages. 

    I will check with the Mesh team to get a confirmation. Before we can have the official fix from the team, the workaround here is to increase the timeout to 5 seconds so that the CPU have enough time to erase the flash. 5 seconds is enough for an image size of 224kB. The draw back of this is when a packet is dropped on UART (which don't usually happens) it would take 5 seconds before the nrfutil retransmit the packet. 

  • I am not a python guru.
    But wouldn't a combination of tweaking ACK_WAIT_TIME and MAX_RETRIES result in a better solution?

    E.g.
    ACK_WAIT_TIME = 1.0
    MAX_RETRIES = 5
    Would result in a total timeout of 5 sec. but not have to wait the 5 sec. per timeout as you propose?

    But I agree with your finding that it is a nrfutil Mesh DFU flash erase timeout problem.

    I have tried setting ACK_WAIT_TIME as you described and I've also tried setting both ACK_WAIT_TIME and MAX_RETRIES as I described.

    Now it starts every time but it ends with

    <t:    1062241>, nrf_mesh_dfu.c,  339, Write complete (0x2003FE90)
    <t:    1062245>, nrf_mesh_dfu.c,  348, Flash idle.
    <t:    1071939>, nrf_mesh_dfu.c,  538, 	RADIO TX! SLOT 1, count 3, interval: exponential, handle: FFFC
    <t:    1071952>, nrf_mesh_dfu.c,  339, Write complete (0x2003FE90)
    <t:    1071955>, nrf_mesh_dfu.c,  348, Flash idle.
    <t:    1080737>, nrf_mesh_dfu.c,  538, 	RADIO TX! SLOT 2, count 3, interval: exponential, handle: FFFC
    <t:    1080750>, nrf_mesh_dfu.c,  339, Write complete (0x2003FE90)
    <t:    1080754>, nrf_mesh_dfu.c,  348, Flash idle.
    <t:    1089672>, nrf_mesh_dfu.c,  398, 	Abort event. Reason: 0x3

    before reaching 2%.

    After toying with it a few minutes I tried to change SEND_START_DFU_WAIT_TIME to a value greater than the expected flash erase time. And setting it to 4.0 seems to have helped a great deal.
    It is at 5% right now and it started the first time.

  • Hi Tardis,

    I don't think setting ACK_WAIT_TIME to 1 and MAX_RETRIES to 5 would have the same effect. The problem is that if retry happens when the bootloader is erasing  it will cause a DFU fail. 

    If you are updating the serial node with different application ID to the .zip image, you may run to another issue that we have found and I provided a fix here: 

    https://devzone.nordicsemi.com/f/nordic-q-a/56151/mesh-dfu-relay-problem

    I'm sorry that the DFU for mesh has many issues. We are planning to have a large bug fix for this in the next few weeks. Hopefully all the issues will be solved in the official branch. 
     

  • Thanks Hung Bui.

    Too bad. Would have been nice to be able to remotely update our next test installation.
    I'll will put it aside for now and await your update.

Related