mesh dfu timeout

Hello

I'm exploring the mesh dfu transfer, get some strange results

Here is the setup

1 distributor, 1 target

On the distributor I start  the transfer with mesh models dfd start 0 0 command

In about 40 minutes it completes, and the image is applied but I get a timeout message on the distributor side

However on the target I see a message from the mcuboot that the new message was applied successfully - and actually it is working just fine

Log from the distributor and the target below

Another question

The progress in the JSON below is "50" which I think is because the function that calculates progress divides by 2:

Function cmd_dfd_receivers_get()  in dfd.c  has this lihe
uint8_t progress = bt_mesh_dfu_cli_progress(&dfd_srv->dfu) / 2;
However if I use 
mesh models dfu cli progress command it returns
DFU progress: 100 %
Bug?
Thank you
Andy
--------------------------------
Logs

Distributor

rtt:~$ [00:40:51.061,126] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
rtt:~$ [00:40:51.061,157] <dbg> bt_mesh_dfu_cli: dfu_complete:
rtt:~$ [00:40:51.061,187] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 0, phase: 1, apply: 1
rtt:~$ Distribution phase changed to Applying Update
rtt:~$ [00:40:51.061,218] <dbg> bt_mesh_dfu_cli: apply:
rtt:~$ rtt:~$ [00:40:51.061,248] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 1 targets
rtt:~$ rtt:~$ [00:40:51.271,606] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
rtt:~$ [00:40:51.271,636] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x01a2 receiving transfer
rtt:~$ rtt:~$ [00:41:00.282,928] <dbg> bt_mesh_blob_cli: retry_timeout: 5
rtt:~$ rtt:~$ [00:41:09.441,864] <dbg> bt_mesh_blob_cli: retry_timeout: 4
rtt:~$ rtt:~$ [00:41:18.604,400] <dbg> bt_mesh_blob_cli: retry_timeout: 3
rtt:~$ rtt:~$ [00:41:27.766,143] <dbg> bt_mesh_blob_cli: retry_timeout: 2
rtt:~$ rtt:~$ [00:41:36.924,011] <dbg> bt_mesh_blob_cli: retry_timeout: 1
rtt:~$ [00:41:36.924,041] <dbg> bt_mesh_blob_cli: retry_timeout: Transfer timed out.
rtt:~$ [00:41:36.924,041] <dbg> bt_mesh_blob_cli: drop_remaining_targets:
rtt:~$ [00:41:36.924,072] <wrn> bt_mesh_blob_cli: Dropping 0x01a2: 9
rtt:~$ [00:41:36.924,072] <err> bt_mesh_dfu_cli: Target 0x01a2 failed: 3
rtt:~$ [00:41:36.924,102] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
rtt:~$ [00:41:36.924,102] <dbg> bt_mesh_dfu_cli: dfu_failed: 3
rtt:~$ [00:41:36.924,133] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 3, phase: 3, apply: 1
rtt:~$ Distribution phase changed to Failed
rtt:~$ rtt:~$ mesh models dfd receivers-get 0 1
{
"target_cnt": 1,
"targets": {
"0": { "blob_addr": 418, "phase": 6, "status": 3, "blob_status": 9, "progress": 50, "img_idx": 0 }
}
}

-----------------------------------------

Target

*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
Current image version: 2.1.0+8

~$ Firmware upload started

~$ Firmware upload succeeded
Applying the new firmware
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
I: Starting bootloader
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x1, image_ok=0x3
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Image index: 0, Swap type: test
I: Starting swap using move algorithm.
I: Bootloader chainload address offset: 0x8000
I: Jumping to the first image slot
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
Current image version: 2.1.0+9

Parents
  • Hi, 

    Are you using the unmodified samples or your own?

    Please check if you call bt_mesh_dfu_srv_applied():

    https://github.com/nrfconnect/sdk-nrf/blob/dece2a7f6c36674e2714b8dad7294c911548bb62/samples/bluetooth/mesh/dfu/common/src/dfu_target.c#L268-L278

    https://github.com/nrfconnect/sdk-nrf/blob/dece2a7f6c36674e2714b8dad7294c911548bb62/samples/bluetooth/mesh/dfu/target/src/main.c#L132

    In the new image, are you changing the node's composition data (like the number of models and elements)? I will check if we can reproduce this. Are you changing any timeouts manually?

    "50" is not a bug. The output in the json is formatted in the spec units where transfer progress is calculated in 2 precent step. in other words, the output in json is not a percentage but the spec-defined format

    -Amanda H.

  • I am using unmodified samples you provided. Yes bt_mesh_dfu_srv_applied is called on the target side.

    No composition changes in the image, Just the version number

    Tried with and without a custom timeout, no difference. Image is transferred, applied and runs on the target just fine, but on the distributor side there is an error after the "apply" step. See logs above

    Also, what is phase 10 ? Saw that several times. It's not defined in the headers. Uninitialized variable?

    Another quirk I'd like to resolve is  why it fails to proceed with the batch transfer if one of the target nodes is offline

    Thanks for clarifying the "50" progress thing

  • phase 10 is the DFU phase, not the dfd phase and is unknown phase which means that no response was ever received from that node:

    - https://github.com/zephyrproject-rtos/zephyr/blob/39917ae5e6423fe74edce399b927ddff38670817/include/zephyr/bluetooth/mesh/dfu.h#L73-L74

    - https://github.com/zephyrproject-rtos/zephyr/blob/39917ae5e6423fe74edce399b927ddff38670817/include/zephyr/bluetooth/mesh/dfu_cli.h#L63-L64

    Progress 0 is because the node with address 992 didn't respond to the Firmware Update Start message, but by the dfu procedure the distributor must get a response from all nodes before starting sending the blob.

    timeout_base 120 means that the actual timeout for each DFU phase is 10 * 1000 * (120 + 2) + 7 * 100 = ~20 minutes.

    The distributor will repeat the request CONFIG_BT_MESH_BLOB_CLI_BLOCK_RETRIES times, which is 5 by default. So, it will repeat every 4 minutes. The reason I asked to increase timeout_base is because most likely the timeout was shorter than the time needed for devices to flash the new firmware. They need to adjust the timeout based on their setup.

     

    Also, since it is a unicast transfer, every chunk transmission will require confirmation from a node which may delay the transfer even more. I recommend to use multicast transfer for nodes more than 2. For this, they need to subscribe all DFU and blob models to the same address and set `group` field to that address in `mesh models dfd start` command.

     

    The comment `/* TODO: Need to recover the effect. */` means that the image effect (`img_effect` variable) which is set in `dfu_meta_check` function is not stored anywhere. This code is called after reboot when DFU model recovers the abandoned DFU transfer and is needed to take a decision at the apply phase (see `dfu_appply` function in the same sample).

     

    If target devices start to reboot cyclically I suggest to check if `CONFIG_RESET_ON_FATAL_ERROR` Kconfig option is enabled in `build/target/zephyr/.config` and disable it explicitly in the prj.conf. Then run the devices again.

     

    I still would like to see the complete log from the distributor and target nodes. Like the first one from RTT that you attached.

  • Regarding the multicast transfer

    For this, they need to subscribe all DFU and blob models to the same address and set `group` field to that address in `mesh models dfd start` command.

    I assume you are referring to the target nodes?

    Progress 0 is because the node with address 992 didn't respond to the Firmware Update Start message, but by the dfu procedure the distributor must get a response from all nodes before starting sending the blob.

    This is unfortunate. On a live mesh network with many nodes there will always be a node that would not respond for a variety of reasons. If the whole process is blocked by just one node it makes it not very usable in a real life situation. 

    I will provide the logs you requested. May take a while

    Thanks a lot for your help

  • Here is the log from the distributor right before the image finished transferring

    Questions

    1. What do the  warnings about "no free slots for segmented messages" mean? Is it something that needs to be addressed? Number of slots increased?

    2. Timeout base was set to 10. Still got an error applying on the distributor side. On the target side the image was applied and is running 

    Thoughts?

    Thank you

    rtt:~$ [05:11:41.929,473] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 8 targets
    rtt:~$ rtt:~$ [05:11:42.036,682] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ [05:11:42.036,712] <dbg> bt_mesh_blob_cli: rx_block_status: 0x0197: block: 98 status: 0
    rtt:~$ [05:11:42.036,712] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x0197 received all chunks
    rtt:~$ [05:11:42.036,743] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0197, pending: 8
    rtt:~$ [05:11:42.059,356] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ [05:11:42.059,387] <dbg> bt_mesh_blob_cli: rx_block_status: 0x0196: block: 98 status: 0
    rtt:~$ [05:11:42.059,387] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x0196 received all chunks
    rtt:~$ [05:11:42.059,417] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0196, pending: 7
    rtt:~$ [05:11:42.150,360] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ [05:11:42.150,390] <dbg> bt_mesh_blob_cli: rx_block_status: 0x03e5: block: 98 status: 0
    rtt:~$ [05:11:42.150,421] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x03e5 received all chunks
    rtt:~$ [05:11:42.150,421] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e5, pending: 6
    rtt:~$ [05:11:42.162,811] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ rtt:~$ [05:11:42.162,811] <dbg> bt_mesh_blob_cli: rx_block_status: 0x0199: block: 98 status: 0
    rtt:~$ [05:11:42.162,841] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x0199 received all chunks
    rtt:~$ [05:11:42.162,841] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0199, pending: 5
    rtt:~$ [05:11:42.222,076] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ [05:11:42.222,106] <dbg> bt_mesh_blob_cli: rx_block_status: 0x01a2: block: 98 status: 0
    rtt:~$ [05:11:42.222,106] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x01a2 received all chunks
    rtt:~$ [05:11:42.222,137] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a2, pending: 4
    rtt:~$ [05:11:42.265,563] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ [05:11:42.265,594] <dbg> bt_mesh_blob_cli: rx_block_status: 0x03e3: block: 98 status: 0
    rtt:~$ [05:11:42.265,594] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x03e3 received all chunks
    rtt:~$ [05:11:42.265,625] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e3, pending: 3
    rtt:~$ [05:11:42.356,536] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ rtt:~$ [05:11:42.356,567] <dbg> bt_mesh_blob_cli: rx_block_status: 0x00d6: block: 98 status: 0
    rtt:~$ [05:11:42.356,597] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x00d6 received all chunks
    rtt:~$ [05:11:42.356,597] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x00d6, pending: 2
    rtt:~$ [05:11:42.400,756] <dbg> bt_mesh_blob_cli: handle_block_status: status: 0 block: 98 encoding: 1
    rtt:~$ [05:11:42.400,787] <dbg> bt_mesh_blob_cli: rx_block_status: 0x01a0: block: 98 status: 0
    rtt:~$ [05:11:42.400,787] <dbg> bt_mesh_blob_cli: rx_block_status: Target 0x01a0 received all chunks
    rtt:~$ [05:11:42.400,817] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a0, pending: 1
    rtt:~$ [05:11:42.400,848] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
    rtt:~$ [05:11:42.400,848] <dbg> bt_mesh_blob_cli: block_check_end:
    rtt:~$ [05:11:42.400,878] <dbg> bt_mesh_blob_cli: block_check_end: No more missing chunks for block 98
    rtt:~$ [05:11:42.400,970] <dbg> bt_mesh_blob_cli: confirm_transfer:
    rtt:~$ [05:11:42.401,000] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 8 targets
    rtt:~$ rtt:~$ [05:11:42.569,824] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.639,770] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.643,768] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:42.643,798] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0197, pending: 8
    rtt:~$ [05:11:42.716,674] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.761,138] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:42.761,138] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0196, pending: 7
    rtt:~$ [05:11:42.801,971] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.835,601] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.839,111] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.876,678] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.897,613] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.927,093] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.970,275] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.971,496] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:42.989,379] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ rtt:~$ [05:11:43.002,807] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ rtt:~$ [05:11:43.040,893] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:43.082,519] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:43.103,149] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:43.114,624] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:43.160,095] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ [05:11:43.280,578] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:43.280,609] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e5, pending: 6
    rtt:~$ [05:11:43.350,433] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:43.350,433] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0199, pending: 5
    rtt:~$ [05:11:43.616,241] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:43.616,271] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a2, pending: 4
    rtt:~$ [05:11:43.660,980] <wrn> bt_mesh_transport: No free slots for new incoming segmented messages
    rtt:~$ rtt:~$ [05:11:43.712,341] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:43.712,371] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e3, pending: 3
    rtt:~$ rtt:~$ [05:11:43.828,063] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:43.828,094] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x00d6, pending: 2
    rtt:~$ [05:11:44.346,954] <dbg> bt_mesh_blob_cli: handle_xfer_status: status: 0 push phase: 4 f7f9a7fa30859b6a
    rtt:~$ [05:11:44.346,954] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a0, pending: 1
    rtt:~$ [05:11:44.346,984] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
    rtt:~$ [05:11:44.347,015] <dbg> bt_mesh_blob_cli: end: 1
    rtt:~$ [05:11:44.347,015] <dbg> bt_mesh_dfu_cli: refresh:
    rtt:~$ [05:11:44.347,045] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 8 targets
    rtt:~$ [05:11:44.537,017] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ [05:11:44.537,048] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x0197 receiving transfer
    rtt:~$ [05:11:44.537,048] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0197, pending: 8
    rtt:~$ [05:11:44.576,599] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ rtt:~$ [05:11:44.576,629] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x0196 receiving transfer
    rtt:~$ [05:11:44.576,629] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0196, pending: 7
    rtt:~$ [05:11:44.639,038] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ [05:11:44.639,038] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x03e5 receiving transfer
    rtt:~$ [05:11:44.639,068] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e5, pending: 6
    rtt:~$ rtt:~$ [05:11:44.805,175] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ [05:11:44.805,206] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x0199 receiving transfer
    rtt:~$ [05:11:44.805,206] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0199, pending: 5
    rtt:~$ [05:11:44.937,713] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ [05:11:44.937,713] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x01a2 receiving transfer
    rtt:~$ [05:11:44.937,744] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a2, pending: 4
    rtt:~$ [05:11:45.050,140] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ [05:11:45.050,140] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x03e3 receiving transfer
    rtt:~$ [05:11:45.050,170] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e3, pending: 3
    rtt:~$ [05:11:45.217,559] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ [05:11:45.217,590] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x00d6 receiving transfer
    rtt:~$ [05:11:45.217,620] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x00d6, pending: 2
    rtt:~$ [05:11:45.292,846] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 4, cur state: 2
    rtt:~$ rtt:~$ [05:11:45.292,877] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x01a0 receiving transfer
    rtt:~$ [05:11:45.292,877] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a0, pending: 1
    rtt:~$ [05:11:45.292,907] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
    rtt:~$ [05:11:45.292,938] <dbg> bt_mesh_dfu_cli: dfu_complete:
    rtt:~$ [05:11:45.292,938] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 0, phase: 1, apply: 0
    rtt:~$ Distribution phase changed to Transfer Success

    rtt:~$ rtt:~$ mesh models dfd apply
    { "status": 0, "phase": 3, "group": 0, "app_idx": 0, "ttl": 7, "timeout_base": 20, "xfer_mode": 1, "apply": 0, "slot_idx": 0 }
    rtt:~$ [05:44:02.960,601] <dbg> bt_mesh_dfu_cli: apply:
    rtt:~$ [05:44:02.960,632] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 8 targets
    rtt:~$ Distribution phase changed to Applying Update
    rtt:~$ [05:44:03.086,822] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.086,822] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x0197 receiving transfer
    rtt:~$ [05:44:03.149,230] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.149,261] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x0196 receiving transfer
    rtt:~$ [05:44:03.241,210] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.241,241] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x03e5 receiving transfer
    rtt:~$ [05:44:03.322,570] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.322,601] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x0199 receiving transfer
    rtt:~$ rtt:~$ [05:44:03.558,776] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.558,776] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x01a2 receiving transfer
    rtt:~$ [05:44:03.630,737] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.630,737] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x03e3 receiving transfer
    rtt:~$ [05:44:03.715,850] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.715,850] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x00d6 receiving transfer
    rtt:~$ [05:44:03.818,969] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
    rtt:~$ [05:44:03.819,000] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x01a0 receiving transfer
    rtt:~$ rtt:~$ [05:44:47.864,501] <dbg> bt_mesh_blob_cli: retry_timeout: 5
    rtt:~$ [05:44:47.931,579] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:47.931,579] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:47.931,610] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0197, pending: 8
    rtt:~$ [05:44:47.977,020] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:47.977,050] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:47.977,050] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0196, pending: 7
    rtt:~$ [05:44:48.038,604] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:48.038,604] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:48.038,635] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e5, pending: 6
    rtt:~$ [05:44:48.120,971] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:48.120,971] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:48.121,002] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0199, pending: 5
    rtt:~$ rtt:~$ [05:44:48.143,829] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:48.143,829] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:48.143,859] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a2, pending: 4
    rtt:~$ rtt:~$ [05:44:48.211,273] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:48.211,303] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:48.211,303] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e3, pending: 3
    rtt:~$ [05:44:48.278,472] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:48.278,503] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:48.278,503] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x00d6, pending: 2
    rtt:~$ [05:44:48.312,652] <dbg> bt_mesh_dfu_cli: handle_status: 2 phase: 0, cur state: 4
    rtt:~$ [05:44:48.312,683] <dbg> bt_mesh_dfu_cli: handle_status: Response received with Idle phase
    rtt:~$ [05:44:48.312,683] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a0, pending: 1
    rtt:~$ [05:44:48.321,777] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
    rtt:~$ [05:44:48.321,807] <dbg> bt_mesh_dfu_cli: dfu_applied:
    rtt:~$ [05:44:48.321,807] <dbg> bt_mesh_dfu_cli: confirm:
    rtt:~$ [05:44:48.321,899] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 8 targets
    rtt:~$ rtt:~$ [05:44:48.453,460] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x0197 from index 0
    rtt:~$ [05:44:48.453,521] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:48.453,582] <wrn> bt_mesh_dfu_cli: Target 0x0197 failed to apply image: 020100000000000a
    rtt:~$ [05:44:48.453,582] <err> bt_mesh_dfu_cli: Target 0x0197 failed: 3
    rtt:~$ [05:44:48.453,613] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0197, pending: 8
    rtt:~$ [05:44:48.502,838] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x0196 from index 0
    rtt:~$ [05:44:48.502,868] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:48.502,929] <wrn> bt_mesh_dfu_cli: Target 0x0196 failed to apply image: 020100000000000a
    rtt:~$ [05:44:48.502,929] <err> bt_mesh_dfu_cli: Target 0x0196 failed: 3
    rtt:~$ [05:44:48.502,960] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0196, pending: 7
    rtt:~$ [05:44:48.576,080] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x03e5 from index 0
    rtt:~$ rtt:~$ [05:44:48.576,171] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:48.576,232] <wrn> bt_mesh_dfu_cli: Target 0x03e5 failed to apply image: 020100000000000a
    rtt:~$ [05:44:48.576,232] <err> bt_mesh_dfu_cli: Target 0x03e5 failed: 3
    rtt:~$ [05:44:48.576,263] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e5, pending: 6
    rtt:~$ rtt:~$ [05:44:48.685,485] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x0199 from index 0
    rtt:~$ [05:44:48.685,516] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:48.685,577] <wrn> bt_mesh_dfu_cli: Target 0x0199 failed to apply image: 020100000000000a
    rtt:~$ [05:44:48.685,607] <err> bt_mesh_dfu_cli: Target 0x0199 failed: 3
    rtt:~$ [05:44:48.685,607] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x0199, pending: 5
    rtt:~$ [05:44:48.860,198] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x01a2 from index 0
    rtt:~$ [05:44:48.860,260] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:48.860,290] <wrn> bt_mesh_dfu_cli: Target 0x01a2 failed to apply image: 020100000000000a
    rtt:~$ [05:44:48.860,321] <err> bt_mesh_dfu_cli: Target 0x01a2 failed: 3
    rtt:~$ [05:44:48.860,321] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a2, pending: 4
    rtt:~$ rtt:~$ [05:44:48.992,858] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x03e3 from index 0
    rtt:~$ [05:44:48.992,889] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:48.992,950] <wrn> bt_mesh_dfu_cli: Target 0x03e3 failed to apply image: 020100000000000a
    rtt:~$ [05:44:48.992,980] <err> bt_mesh_dfu_cli: Target 0x03e3 failed: 3
    rtt:~$ [05:44:48.992,980] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x03e3, pending: 3
    rtt:~$ [05:44:49.118,438] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x00d6 from index 0
    rtt:~$ [05:44:49.118,499] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:49.118,560] <wrn> bt_mesh_dfu_cli: Target 0x00d6 failed to apply image: 020100000000000a
    rtt:~$ [05:44:49.118,560] <err> bt_mesh_dfu_cli: Target 0x00d6 failed: 3
    rtt:~$ [05:44:49.118,591] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x00d6, pending: 2
    rtt:~$ rtt:~$ [05:44:49.213,592] <dbg> bt_mesh_dfu_cli: handle_info_status: Image list from 0x01a0 from index 0
    rtt:~$ [05:44:49.213,653] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000
    rtt:~$ [05:44:49.213,714] <wrn> bt_mesh_dfu_cli: Target 0x01a0 failed to apply image: 020100000000000a
    rtt:~$ [05:44:49.213,714] <err> bt_mesh_dfu_cli: Target 0x01a0 failed: 3
    rtt:~$ [05:44:49.213,745] <dbg> bt_mesh_blob_cli: blob_cli_broadcast_rsp: 0x01a0, pending: 1
    rtt:~$ [05:44:49.213,775] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
    rtt:~$ [05:44:49.213,775] <dbg> bt_mesh_dfu_cli: dfu_failed: 3
    rtt:~$ [05:44:49.213,806] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 3, phase: 3, apply: 0
    rtt:~$ Distribution phase changed to Failed

  • Hi,

    AndyM said:
    rtt:~$ [05:44:48.453,521] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000

    You should increase the value of CONFIG_BT_MESH_SEG_BUFS to 80 or 100. 

    AndyM said:

    2. Timeout base was set to 10. Still got an error applying on the distributor side. On the target side the image was applied and is running 

    Thoughts?

    All targets replied with fwid 020100000a000000:

    rtt:~$ [05:44:48.685,516] <dbg> bt_mesh_dfu_cli: handle_info_status: Image 0
    fwid: 020100000a000000

    But fwid on the distributor is 020100000000000a:

    rtt:~$ [05:44:48.685,577] <wrn> bt_mesh_dfu_cli: Target 0x0199 failed to apply image: 020100000000000a

    This value should be used when adding a new slot on the distributor. Please check what fwid you set through mesh models dfu slot add. It should be equal to mcuboot image version encoded into struct mcuboot_img_sem_ver:

    -Amanda H.

  • Great, thank you. I'll implement your suggestions and let you know

    Really appreciate your help

Reply Children
  • The failure to apply on the distributor side was due to the wrong firmware ID provided to "slot add" command. Once that was corrected the apply completed successfully on the distributor side

    Tried using group address to transfer  - results are unexpected.

    It took 2 hours and 40 minutes to transfer to 1 device. Without using a  group address it takes 40 minutes

    I will try with multiple devices and let you know.

    Without group address it took about 5 hours to transfer to 8 devices

  • Tried 8 devices

    On the target side subscribed 

    BT_MESH_MODEL_ID_BLOB_SRV (0x1400) and 
    BT_MESH_MODEL_ID_DFU_SRV  (0x1402)  to the group address 0xC008 
    Then I used the command
    mesh models dfd start 0 0 0xC008 false 7 20
    I get the following
    mesh models dfd start 0 0 0xC008 false 7 20
    { "status": 0, "phase": 1, "group": 49160, "app_idx": 0, "ttl": 7, "timeout_base": 20, "xfer_mode": 1, "apply": 0, "slot_idx": 0 }
    mdist:~$ Starting the firmware distribution.
    Slot:
    Size: 403239 bytes
    FWID: 020100000c000000
    Metadata:
    [00:02:10.500,823] <dbg> bt_mesh_dfd_srv: bt_mesh_dfd_srv_start: Distribution Start: slot: 0, appidx: 0, tb: 20, addr: C008, ttl: 7, apply: 0
    Distribution phase changed to Transfer Active
    [00:02:10.500,976] <dbg> bt_mesh_dfu_cli: initiate:
    [00:02:10.501,007] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 3 targets
    [00:02:10.667,205] <wrn> bt_mesh_transport: Replay: src 0x0199 dst 0x0356 seq 0x000102
    [00:02:10.749,511] <wrn> bt_mesh_transport: Replay: src 0x0196 dst 0x0356 seq 0x000083
    [00:02:10.750,885] <wrn> bt_mesh_transport: Replay: src 0x0199 dst 0x0356 seq 0x000103
    [00:02:10.787,658] <wrn> bt_mesh_transport: Replay: src 0x0197 dst 0x0356 seq 0x000226
    [00:02:10.810,913] <wrn> bt_mesh_transport: Replay: src 0x01a2 dst 0x0356 seq 0x000082
    [00:02:10.855,224] <wrn> bt_mesh_transport: Replay: src 0x01a0 dst 0x0356 seq 0x0002ac
    [00:02:10.862,030] <wrn> bt_mesh_transport: Replay: src 0x0197 dst 0x0356 seq 0x000227
    [00:02:10.890,319] <wrn> bt_mesh_transport: Replay: src 0x01a2 dst 0x0356 seq 0x000083
    [00:02:10.913,360] <wrn> bt_mesh_transport: Replay: src 0x01a0 dst 0x0356 seq 0x0002ad
    [00:02:10.938,659] <wrn> bt_mesh_transport: Replay: src 0x03e3 dst 0x0356 seq 0x000082
    [00:02:10.954,711] <wrn> bt_mesh_transport: Replay: src 0x03e3 dst 0x0356 seq 0x000083
    [00:02:11.038,574] <wrn> bt_mesh_transport: Replay: src 0x00d6 dst 0x0356 seq 0x000082
    [00:02:11.095,733] <wrn> bt_mesh_transport: Replay: src 0x00d6 dst 0x0356 seq 0x000083
    [00:02:11.128,143] <wrn> bt_mesh_transport: Replay: src 0x03e5 dst 0x0356 seq 0x0001a0
    [00:02:11.205,993] <wrn> bt_mesh_transport: Replay: src 0x03e5 dst 0x0356 seq 0x0001a1
    [

    mdist:~$ mesh models dfd receivers-get 0 20
    {
    "target_cnt": 3,
    "targets": {
    "0": { "blob_addr": 406, "phase": 10, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "1": { "blob_addr": 409, "phase": 10, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
    "2": { "blob_addr": 995, "phase": 10, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 }
    }
    }
    replay warnings continue until eventually I get
    [00:05:55.908,813] <dbg> bt_mesh_blob_cli: retry_timeout: Transfer timed out.
    [00:05:55.908,813] <dbg> bt_mesh_blob_cli: drop_remaining_targets:
    [00:05:55.908,843] <wrn> bt_mesh_blob_cli: Dropping 0x0196: 9
    [00:05:55.908,843] <err> bt_mesh_dfu_cli: Target 0x0196 failed: 3
    [00:05:55.908,843] <wrn> bt_mesh_blob_cli: Dropping 0x0199: 9
    [00:05:55.908,874] <err> bt_mesh_dfu_cli: Target 0x0199 failed: 3
    [00:05:55.908,874] <wrn> bt_mesh_blob_cli: Dropping 0x03e3: 9
    [00:05:55.908,874] <err> bt_mesh_dfu_cli: Target 0x03e3 failed: 3
    [00:05:55.908,905] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
    [00:05:55.908,935] <dbg> bt_mesh_dfu_cli: transfer:
    [00:05:55.908,966] <dbg> bt_mesh_dfu_cli: dfu_failed: 3
    [00:05:55.908,966] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 3, phase: 1, apply: 0
    Distribution phase changed to Failed
    Another quirk
    I have 8 devices. Only added 3 addresses to the distributor list, yet get the  dfd_start() callback called on ALL 8 devices. Why?d
    What else do I need to do to make it work?
    Thanks
  • Hi, 

    Unicast transfer time will scale up almost linearly with the number of devices ... with a pretty steep slope (because the distributor sends chunks to each individual device one by one in the round-robin manner). Group transfer time will also scale up linearly but with much much less slope with the number of devices in a group.

    AndyM said:
    get the  dfd_start() callback called on ALL 8 devices.

    What does this mean? Can you clarify this a bit?

    AndyM said:

    Tried using group address to transfer  - results are unexpected.

    It took 2 hours and 40 minutes to transfer to 1 device. Without using a  group address it takes 40 minutes

    I wouldn't be surprised. This happens because during unicast transfer each chunk transfer invokes unicast SAR message. In unicast SAR message transfer, the next Chunk Transfer message cannot be sent until all segments of the previous one have been acknowledged by the receiver. This causes slowness. The slow transfer is beneficial in a large network because you don't want to disrupt the network operation when a DFU transfer is going on by pushing too much traffic into it. 

    For group SAR, there are no segment acknowledgment messages and the sender of the SAR message simply repeats all segments thrice (CONFIG_BT_MESH_SAR_TX_MULTICAST_RETRANS_COUNT=0x02). Once all are sent, the next Chunk Transfer message goes out. So this causes the group transfer to become fast. 

    Our recommendation is to have DFU transfer as slow transfer as possible to reduce the disruption of the network operation for end-users of the mesh network. If this is not a concern, then disregard the recommendation. In older NCS revisions we could not control the maximum allowed chunk size for the BLOB server. But now it can be done via the "CONFIG_BT_MESH_RX_BLOB_CHUNK_SIZE" option. If you are using an older SDK then, for your experiments - I will suggest you to reduce BT_MESH_RX_SDU_MAX (which will inturn reduce the maximum size of the incoming message supported by the node for all mesh messages or just edit blob_srv.c and set CHUNK_SIZE_MAX to smaller or bigger values (say, 30, 40, 60) to experiment with the effect of transfer time on the size of the Chunk Transfer message. We haven't fully characterized this parameter (i.. the size of the Chunk Transfer message) however, my judgment is that an 8-12 segment long chunk transfer message should be a good balance.

    AndyM said:
    [00:02:10.667,205] <wrn> bt_mesh_transport: Replay: src 0x0199 dst 0x0356 seq 0x000102

    The replay warning indicates that the three nodes are somehow sending messages with old sequence numbers. Are you trying to reflash the devices and re-configure the "target" devices (using iOS mobile apps 'provision and reconfigure' feature),  between your experiments? If you do that, your target device's sequence number will reset and all communication from them to the Distributor will be filtered out by the Distributor's replay protection mechanism. To avoid falling into this trap, every time you want to retry the DFU on target devices, I suggest you to, reflash targets without doing flash erase (i.e. do "west flash", instead of doing "west flash --erase") (and as a bonus, you won't have to re-provision and reconfigure them from mobile app).

    -Amanda H.

  • Thank you very much for your insights, that's very helpful

    I still 2 unresolved issues

    1.A transfer to a group failed with  all those replay messages and eventually failed . Yes, I was doing full erase and reprovision of targets between my experiments  but not the distributor   I will start from scratch, reflash and reprovision all devices and report back. However your comments raise a question: What if I need to provision a new device into an existing network? Will a group  DFU fail for that device?

    2. Why was the "dfu_start()" callback called on all 8 devices when only 3 were in the distributor list? I made sure to clear the list, then add 3 devices  before starting dfu. 

  • If you get a timeout error, you most likely need to play with the timeout_base value. They can try to enable logs on target nodes, check the timestamp between the last received message in BLOB model and this log message: https://github.com/zephyrproject-rtos/zephyr/blob/23fb3b77a2b2cf53eff1697888f443236069b30c/subsys/bluetooth/mesh/blob_srv.c#L289 and estimate timeout_base value. 

    As the explanation, for more than 2 devices it is better to use a group address to increase the probability of delivering chunks to target nodes because the transport layer will try to guarantee a delivery of a segmented message to a specific target node. This, however, significantly slows down the dfu transfer if there are more than 2 devices as other devices sit and wait until the transport layer confirms delivery to a specific target node. In this case, a shorter timeout_base value causes timeouts between distributor and target nodes.

    But also because the timeout_base value is used along the whole procedure including the apply phase, you apply phase may take a much longer time for the distributor to confirm that all target nodes applied the firmware. This is because the distributor will sleep for that time until it tries to check again the status of firmware on target nodes.

    AndyM said:
    2. Why was the "dfu_start()" callback called on all 8 devices when only 3 were in the distributor list? I made sure to clear the list, then add 3 devices  before starting dfu. 

    If you are using the group address, all devices subscribed to that address will receive the message. When doing DFU with a group address, you need to subscribe to that group address only those nodes that will participate in the transfer. After completing the DFU they should unsubscribe nodes from that address.

    Regarding the replay warnings, you need to try increasing CONFIG_BT_MESH_MSG_CACHE_SIZE to a much bigger value.

    AndyM said:
    1.A transfer to a group failed with  all those replay messages and eventually failed . Yes, I was doing full erase and reprovision of targets between my experiments  but not the distributor   I will start from scratch, reflash and reprovision all devices and report back. However your comments raise a question: What if I need to provision a new device into an existing network? Will a group  DFU fail for that device?

    That won't fail as long as the new device does not use an already-used address.

    If full erase of the targets is done in between the experiments, those targets should be provisioned with new addresses, then replay issues won't occur. Though, it is important to note that every new address that communicates with the distributor, will occupy one entry in the replay protection list.

Related