mesh dfu timeout

AndyM 8 months ago

Hello

I'm exploring the mesh dfu transfer, get some strange results

Here is the setup

1 distributor, 1 target

On the distributor I start the transfer with mesh models dfd start 0 0 command

In about 40 minutes it completes, and the image is applied but I get a timeout message on the distributor side

However on the target I see a message from the mcuboot that the new message was applied successfully - and actually it is working just fine

Log from the distributor and the target below

Another question

The progress in the JSON below is "50" which I think is because the function that calculates progress divides by 2:

Function cmd_dfd_receivers_get() in dfd.c has this lihe

uint8_t progress = bt_mesh_dfu_cli_progress(&dfd_srv->dfu) / 2;

However if I use

mesh models dfu cli progress command it returns
DFU progress: 100 %

Bug?

Thank you

Andy

--------------------------------

Logs

Distributor

rtt:~$ [00:40:51.061,126] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
rtt:~$ [00:40:51.061,157] <dbg> bt_mesh_dfu_cli: dfu_complete:
rtt:~$ [00:40:51.061,187] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 0, phase: 1, apply: 1
rtt:~$ Distribution phase changed to Applying Update
rtt:~$ [00:40:51.061,218] <dbg> bt_mesh_dfu_cli: apply:
rtt:~$ rtt:~$ [00:40:51.061,248] <dbg> bt_mesh_blob_cli: blob_cli_broadcast: 1 targets
rtt:~$ rtt:~$ [00:40:51.271,606] <dbg> bt_mesh_dfu_cli: handle_status: 0 phase: 6, cur state: 4
rtt:~$ [00:40:51.271,636] <dbg> bt_mesh_dfu_cli: handle_status: Target 0x01a2 receiving transfer
rtt:~$ rtt:~$ [00:41:00.282,928] <dbg> bt_mesh_blob_cli: retry_timeout: 5
rtt:~$ rtt:~$ [00:41:09.441,864] <dbg> bt_mesh_blob_cli: retry_timeout: 4
rtt:~$ rtt:~$ [00:41:18.604,400] <dbg> bt_mesh_blob_cli: retry_timeout: 3
rtt:~$ rtt:~$ [00:41:27.766,143] <dbg> bt_mesh_blob_cli: retry_timeout: 2
rtt:~$ rtt:~$ [00:41:36.924,011] <dbg> bt_mesh_blob_cli: retry_timeout: 1
rtt:~$ [00:41:36.924,041] <dbg> bt_mesh_blob_cli: retry_timeout: Transfer timed out.
rtt:~$ [00:41:36.924,041] <dbg> bt_mesh_blob_cli: drop_remaining_targets:
rtt:~$ [00:41:36.924,072] <wrn> bt_mesh_blob_cli: Dropping 0x01a2: 9
rtt:~$ [00:41:36.924,072] <err> bt_mesh_dfu_cli: Target 0x01a2 failed: 3
rtt:~$ [00:41:36.924,102] <dbg> bt_mesh_blob_cli: broadcast_complete: continuing
rtt:~$ [00:41:36.924,102] <dbg> bt_mesh_dfu_cli: dfu_failed: 3
rtt:~$ [00:41:36.924,133] <dbg> bt_mesh_dfd_srv: dfu_ended: reason: 3, phase: 3, apply: 1
rtt:~$ Distribution phase changed to Failed
rtt:~$ rtt:~$ mesh models dfd receivers-get 0 1
{
"target_cnt": 1,
"targets": {
"0": { "blob_addr": 418, "phase": 6, "status": 3, "blob_status": 9, "progress": 50, "img_idx": 0 }
}
}

-----------------------------------------

Target

*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
Current image version: 2.1.0+8

~$ Firmware upload started

~$ Firmware upload succeeded
Applying the new firmware
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
I: Starting bootloader
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x1, image_ok=0x3
I: Secondary image: magic=good, swap_type=0x2, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Image index: 0, Swap type: test
I: Starting swap using move algorithm.
I: Bootloader chainload address offset: 0x8000
I: Jumping to the first image slot
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
Current image version: 2.1.0+9

Parents

0 Amanda Hsieh 8 months ago

Hi,

Are you using the unmodified samples or your own?

Please check if you call bt_mesh_dfu_srv_applied():

https://github.com/nrfconnect/sdk-nrf/blob/dece2a7f6c36674e2714b8dad7294c911548bb62/samples/bluetooth/mesh/dfu/common/src/dfu_target.c#L268-L278

https://github.com/nrfconnect/sdk-nrf/blob/dece2a7f6c36674e2714b8dad7294c911548bb62/samples/bluetooth/mesh/dfu/target/src/main.c#L132

In the new image, are you changing the node's composition data (like the number of models and elements)? I will check if we can reproduce this. Are you changing any timeouts manually?

"50" is not a bug. The output in the json is formatted in the spec units where transfer progress is calculated in 2 precent step. in other words, the output in json is not a percentage but the spec-defined format

-Amanda H.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Amanda Hsieh 8 months ago

Hi,

Are you using the unmodified samples or your own?

Please check if you call bt_mesh_dfu_srv_applied():

https://github.com/nrfconnect/sdk-nrf/blob/dece2a7f6c36674e2714b8dad7294c911548bb62/samples/bluetooth/mesh/dfu/common/src/dfu_target.c#L268-L278

https://github.com/nrfconnect/sdk-nrf/blob/dece2a7f6c36674e2714b8dad7294c911548bb62/samples/bluetooth/mesh/dfu/target/src/main.c#L132

In the new image, are you changing the node's composition data (like the number of models and elements)? I will check if we can reproduce this. Are you changing any timeouts manually?

"50" is not a bug. The output in the json is formatted in the spec units where transfer progress is calculated in 2 precent step. in other words, the output in json is not a percentage but the spec-defined format

-Amanda H.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 AndyM 8 months ago in reply to Amanda Hsieh

I am using unmodified samples you provided. Yes bt_mesh_dfu_srv_applied is called on the target side.

No composition changes in the image, Just the version number

Tried with and without a custom timeout, no difference. Image is transferred, applied and runs on the target just fine, but on the distributor side there is an error after the "apply" step. See logs above

Also, what is phase 10 ? Saw that several times. It's not defined in the headers. Uninitialized variable?

Another quirk I'd like to resolve is why it fails to proceed with the batch transfer if one of the target nodes is offline

Thanks for clarifying the "50" progress thing
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Amanda Hsieh 8 months ago in reply to AndyM

AndyM said:
Also, what is phase 10 ? Saw that several times.

Can't see this in the logs.

AndyM said:
why it fails to proceed with the batch transfer if one of the target nodes is offline

Can you provide the full log?

I may also suggest to increase timeout_base value

This one (if you use shell to start dfu transfer): https://github.com/zephyrproject-rtos/zephyr/blob/zephyr-v3.5.0/doc/connectivity/bluetooth/api/mesh/shell.rst?plain=1#L1411
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 AndyM 8 months ago in reply to Amanda Hsieh

Amanda Hsieh said:
Can you provide the full log?

See the first log in the ticket. It is from a distribution to 8 nodes with "apply" option. As soon as the transfer to one of the nodes completed, the image was applied. That caused a timeout and the whole batch stopped.

I tried timeout 10, will try something bigger and let you know

Thank you
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 AndyM 8 months ago in reply to AndyM

Amanda Hsieh said:
Can you provide the full log?

See the first log in the ticket. It is from a distribution to 8 nodes with "apply" option. As soon as the transfer to one of the nodes completed, the image was applied. That caused a timeout and the whole batch stopped.

I tried timeout 10, will try something bigger and let you know

Here is the status 10 example.

mesh models dfd receivers-get 0 20
{
"target_cnt": 8,
"targets": {
"0": { "blob_addr": 407, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
"1": { "blob_addr": 406, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
"2": { "blob_addr": 992, "phase": 10, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
"3": { "blob_addr": 409, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
"4": { "blob_addr": 418, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
"5": { "blob_addr": 995, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
"6": { "blob_addr": 214, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 },
"7": { "blob_addr": 419, "phase": 2, "status": 0, "blob_status": 0, "progress": 0, "img_idx": 0 }
}
}

Thank you
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 AndyM 8 months ago in reply to AndyM

A new problem

I did a firmware distribution of 1 device and the transfer succeeded, so I know the process works fine

Then I tried to distribute to 8 devices - see node list above. 992 is offline.
Its phase is listed as 10 which does not exist in phase enumeration. But that's the least of my worries

After I sent "mesh models dfd start 0 0 0 false 7 120" command I did

"mesh models dfd receivers-get 0 20" periodically but the
"progress" field remained at 0 for over 5 minutes at which point I canceled the distribution with
"mesh models dfd cancel" command

After that all my 8 target nodes fail to boot.

Messages in the target node:

*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
Recovering the firmware upload
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***
I: Starting bootloader
I: Primary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3
I: Boot source: none
I: Image index: 0, Swap type: none
I: Bootloader chainload address offset: 0x8000
I: Jumping to the first image slot
*** Booting nRF Connect SDK v3.5.99-ncs1-1 ***

messages repeat

In the image recovery handler of the target sample there is this comment

/* TODO: Need to recover the effect. */

Why is this handler called after the distribution was cancelled?

How do I fix this ?

I am curious, did you test batch distribution? How many targets? Did you test with some nodes offline? Did you test with more targets than the default 8?

Makes me nervous to deploy this feature if a simple cancel bricks all devices. Hope this is something that I am doing wrong and there is a fix

Thanks

.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel