This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Bug in nRF Mesh Bootloader?

I've been developing a solution that utilizes Mesh DFU along with the Mesh Bootloader. However, I've been seeing a frequent issue where the DFU aborts prematurely due to missing packets (see logs below).

<t: 1642549>, nrf_mesh_dfu.c, 456, SEGMENT RX: received seg 385, seg_count 8498
<t: 1642553>, nrf_mesh_dfu.c, 531, RADIO TX! SLOT 4, count 3, interval: exponential, handle: FFFC
<t: 1642569>, nrf_mesh_dfu.c, 324, Write complete (0x2000FE98)
<t: 1642572>, nrf_mesh_dfu.c, 333, Flash idle.
<t: 1728779>, nrf_mesh_dfu.c, 383, Abort event. Reason: 0x3

However, my logs do show that all missing packets were received by the bootloader at some point. In this particular instance, it was segment 322 that was missing. But it shows up later after the 372nd segment, as shown in the logs below.

<t: 454303>, nrf_mesh_dfu.c, 456, SEGMENT RX: received seg 372, seg_count 8498
<t: 454307>, nrf_mesh_dfu.c, 531, RADIO TX! SLOT 4, count 3, interval: exponential, handle: FFFC
<t: 454323>, nrf_mesh_dfu.c, 324, Write complete (0x2000FE98)
<t: 454327>, nrf_mesh_dfu.c, 333, Flash idle.
<t: 525436>, nrf_mesh_dfu.c, 456, SEGMENT RX: received seg 322, seg_count 8498
<t: 525440>, nrf_mesh_dfu.c, 531, RADIO TX! SLOT 5, count 3, interval: exponential, handle: FFFA
<t: 525457>, nrf_mesh_dfu.c, 324, Write complete (0x2000FE98)
<t: 525460>, nrf_mesh_dfu.c, 333, Flash idle.

Oddly enough it aborts on the 64th packet (size of the missing_segments bitmask) from the previously missing segment (segment #322) as if it didn't acknowledge the missing segment. So I looked through the dfu_transfer_data function in dfu_transfer_mesh.c file and noticed around line 199 where it shifts 1 by the segment offset, it isn't forcing 1 to a unsigned long long (uint64). In this error scenario, the value has to be shifted by 50 (previously received segment was 372, so 372 - 322 = 50). I believe the compiler interpreted the integer literal 1 as a 32-bit int. So when it gets left-shifted by 50, the set bit gets shifted out of the 32-bit int resulting in the end value of 0. This is then casted up to a 64-bit unsigned int and applied as an ineffective mask. It would explain why the bootloader believes the 322nd segment is still missing and then errors out upon receiving the 386th segment.

As far as I can tell, this affects both v3.1.0 and v3.2.0 of the nRF5 SDK for Mesh.

Can someone confirm this bug? Thanks!

Related