mesh dfu timeout

Hello

I'm exploring the mesh dfu transfer, get some strange results

Here is the setup

1 distributor, 1 target

On the distributor I start  the transfer with mesh models dfd start 0 0 command

In about 40 minutes it completes, and the image is applied but I get a timeout message on the distributor side

However on the target I see a message from the mcuboot that the new message was applied successfully - and actually it is working just fine

Log from the distributor and the target below

Another question

The progress in the JSON below is "50" which I think is because the function that calculates progress divides by 2:

Function cmd_dfd_receivers_get()  in dfd.c  has this lihe
uint8_t progress = bt_mesh_dfu_cli_progress(&dfd_srv->dfu) / 2;
However if I use 
mesh models dfu cli progress command it returns
DFU progress: 100 %
Bug?
Thank you
Andy
--------------------------------
Logs

Distributor

rtt:~$ [00:40:51.061,126] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
rtt:~$ [00:40:51.061,157] <dbg> bt_mesh_dfu_cli: dfu_complete:
rtt:~$ [00:40:51.061,187] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 0, phase: 1, apply: 1
rtt:~$ Distribution phase changed to Applying Update
rtt:~$ [00:40:51.061,218] <dbg> bt_mesh_dfu_cli: apply:
rtt:~$ rtt:~$ [00:40:51.061,248] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 1 targets
rtt:~$ rtt:~$ [00:40:51.271,606] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
rtt:~$ [00:40:51.271,636] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x01a2 receiving transfer
rtt:~$ rtt:~$ [00:41:00.282,928] <dbg> bt_mesh_blob_cli: retry_timeout: 5
rtt:~$ rtt:~$ [00:41:09.441,864] <dbg> bt_mesh_blob_cli: retry_timeout: 4
rtt:~$ rtt:~$ [00:41:18.604,400] <dbg> bt_mesh_blob_cli: retry_timeout: 3
rtt:~$ rtt:~$ [00:41:27.766,143] <dbg> bt_mesh_blob_cli: retry_timeout: 2
rtt:~$ rtt:~$ [00:41:36.924,011] <dbg> bt_mesh_blob_cli: retry_timeout: 1
rtt:~$ [00:41:36.924,041] <dbg> bt_mesh_blob_cli: retry_timeout: Transfer timed out.
rtt:~$ [00:41:36.924,041] <dbg> bt_mesh_blob_cli: drop_remaining_targets:
rtt:~$ [00:41:36.924,072] <wrn> bt_mesh_blob_cli: Dropping 0x01a2: 9
rtt:~$ [00:41:36.924,072] <err> bt_mesh_dfu_cli: Target 0x01a2 failed: 3
rtt:~$ [00:41:36.924,102] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
rtt:~$ [00:41:36.924,102] <dbg> bt_mesh_dfu_cli: dfu_failed: 3
rtt:~$ [00:41:36.924,133] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 3, phase: 3, apply: 1
rtt:~$ Distribution phase changed to Failed
rtt:~$ rtt:~$ mesh models dfd receivers-get 0 1
{
"target_cnt": 1,
"targets": {
"0": { "blob_addr": 418, "phase": 6, "status": 3, "blob_status": 9, "progress": 50, "img_idx": 0 }
}
}

-----------------------------------------

Target

*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
Current image version: 2.1.0+8

~$ Firmware upload started

~$ Firmware upload succeeded
Applying the new firmware
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
I: Starting bootloader
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x1, image_ok=0x3
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Image index: 0, Swap type: test
I: Starting swap using move algorithm.
I: Bootloader chainload address offset: 0x8000
I: Jumping to the first image slot
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
Current image version: 2.1.0+9

Parents Reply Children
  • Can you provide the full log?

    See the first log in the ticket. It is from a distribution to 8 nodes with "apply" option. As soon as the transfer to one of the nodes completed, the image was applied. That caused a timeout and the whole batch stopped.

    I tried timeout 10, will try something bigger and let you know

    Thank you

  • Can you provide the full log?

    See the first log in the ticket. It is from a distribution to 8 nodes with "apply" option. As soon as the transfer to one of the nodes completed, the image was applied. That caused a timeout and the whole batch stopped.

    I tried timeout 10, will try something bigger and let you know

    Here is the  status 10 example.

    mesh models dfd receivers-get 0 20
    {
    "target_cnt": 8,
    "targets": {
    "0": { "blob_addr": 407, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "1": { "blob_addr": 406, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "2": { "blob_addr": 992, "phase": 10, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "3": { "blob_addr": 409, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "4": { "blob_addr": 418, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "5": { "blob_addr": 995, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "6": { "blob_addr": 214, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "7": { "blob_addr": 419, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 }
    }
    }

    Thank you

  • A new problem

    I did a firmware distribution of 1 device and the transfer succeeded, so I know the process works fine

    Then I  tried to distribute to 8 devices - see node list above. 992 is offline.
    Its phase is listed as 10 which does not exist in phase enumeration. But that's the least of my worries

    After I sent  "mesh models dfd start 0 0 0 false 7 120"  command I did

    "mesh models dfd receivers-get 0 20" periodically but the 
    "progress" field remained at 0 for over 5 minutes at which point I canceled the distribution with 
    "mesh models dfd cancel" command

    After that all my 8 target nodes fail to boot.

    Messages in the target node:

    *** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
     Recovering the firmware upload
    *** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
    I: Starting bootloader
    I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
    I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
    I: Boot source: none
    I: Image index: 0, Swap type: none
    I: Bootloader chainload address offset: 0x8000
    I: Jumping to the first image slot
    *** Booting nRF Connect SDK v3.5.99-ncs1-1 ***

    messages repeat

    In the image recovery handler of the target sample there is  this comment

        /* TODO: Need to recover the effect. */
    Why is this handler called after the distribution was cancelled?
    How do I fix this ?
    I am curious, did you test batch distribution? How many targets?  Did you test with some nodes offline? Did you test with more targets than the default 8?
    Makes me nervous to deploy this feature if  a simple cancel bricks all devices. Hope this is something that I am doing wrong and there is a fix
    Thanks
     



    .

  • phase 10 is the DFU phase, not the dfd phase and is unknown phase which means that no response was ever received from that node:

    - https://github.com/zephyrproject-rtos/zephyr/blob/39917ae5e6423fe74edce399b927ddff38670817/include/zephyr/bluetooth/mesh/dfu.h#L73-L74

    - https://github.com/zephyrproject-rtos/zephyr/blob/39917ae5e6423fe74edce399b927ddff38670817/include/zephyr/bluetooth/mesh/dfu_cli.h#L63-L64

    Progress 0 is because the node with address 992 didn't respond to the Firmware Update Start message, but by the dfu procedure the distributor must get a response from all nodes before starting sending the blob.

    timeout_base 120 means that the actual timeout for each DFU phase is 10 * 1000 * (120 + 2) + 7 * 100 = ~20 minutes.

    The distributor will repeat the request CONFIG_BT_MESH_BLOB_CLI_BLOCK_RETRIES times, which is 5 by default. So, it will repeat every 4 minutes. The reason I asked to increase timeout_base is because most likely the timeout was shorter than the time needed for devices to flash the new firmware. They need to adjust the timeout based on their setup.

     

    Also, since it is a unicast transfer, every chunk transmission will require confirmation from a node which may delay the transfer even more. I recommend to use multicast transfer for nodes more than 2. For this, they need to subscribe all DFU and blob models to the same address and set `group` field to that address in `mesh models dfd start` command.

     

    The comment `/* TODO: Need to recover the effect. */` means that the image effect (`img_effect` variable) which is set in `dfu_meta_check` function is not stored anywhere. This code is called after reboot when DFU model recovers the abandoned DFU transfer and is needed to take a decision at the apply phase (see `dfu_appply` function in the same sample).

     

    If target devices start to reboot cyclically I suggest to check if `CONFIG_RESET_ON_FATAL_ERROR` Kconfig option is enabled in `build/target/zephyr/.config` and disable it explicitly in the prj.conf. Then run the devices again.

     

    I still would like to see the complete log from the distributor and target nodes. Like the first one from RTT that you attached.

  • Regarding the multicast transfer

    For this, they need to subscribe all DFU and blob models to the same address and set `group` field to that address in `mesh models dfd start` command.

    I assume you are referring to the target nodes?

    Progress 0 is because the node with address 992 didn't respond to the Firmware Update Start message, but by the dfu procedure the distributor must get a response from all nodes before starting sending the blob.

    This is unfortunate. On a live mesh network with many nodes there will always be a node that would not respond for a variety of reasons. If the whole process is blocked by just one node it makes it not very usable in a real life situation. 

    I will provide the logs you requested. May take a while

    Thanks a lot for your help

Related