This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Mesh DFU problem

We have a Mesh app that should be updated via DFU.
The problem is that if a device (for whatever reason) restarts in the middle of DFU or it is powered up later (i.e. not working during DFU), it newer receives update.
Here is a app log (device started after DFU):

<t:         23>, ble_softdevice_support.c,  162, sd_ble_enable: app_ram_base should be adjusted to 0x20002DA0
<t:        489>, main.c,   68, Initializing and adding models
<t:        496>, main.c,  111, rom_base   26201
<t:        498>, main.c,  112, rom_end    424D4
<t:        500>, main.c,  113, rom_length 1C2D3
<t:        502>, main.c,  114, bank_addr   43000
<t:        509>, bizlogic.c,  213, Bizlogic init
<t:        511>, gap_listener.c,   85, GAP scanner started.
<t:        514>, gap_advertiser.c,   74, GAP advertiser init
<t:       5389>, nrf_mesh_dfu.c,  529, 	RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFE
<t:       5398>, main.c,  141, Started.
<t:       5510>, nrf_mesh_dfu.c,  391, 	New firmware!
<t:       5512>, dfu.c,   48, NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED_NO_AUTH
<t:       5515>, nrf_mesh_dfu.c,  529, 	RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFD
<t:       5519>, nrf_mesh_dfu.c,  535, Killing a TX slot prematurely (repeats done: 0).
<t:       8167>, nrf_mesh_dfu.c,  529, 	RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFD
<t:       8171>, nrf_mesh_dfu.c,  535, Killing a TX slot prematurely (repeats done: 0).
,

Device obviously knows that is should be updated (NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED_NO_AUTH) but it fails to transfer firmware (Killing a TX slot prematurely).
How to fix this?

And also when the update is transmission over the Mesh is stopped?
If we have Client with ID = 1 and Server with ID = 2, and if Server devices are now updated, they still broadcast this new firmware to all devices and if you want to update Client device, before you can start it is already receiving update from Server devices and this update is transmitted all the time.
How to stop firmware re-transmission over the Mesh, so that you can update Client without Server firmware being relayed, because if you send init packet via serial to Client you get *84 78 87*. ?
And how to update Server devices then to newer version after like a day or so if some device is still broadcasting older firmware?

[Mesh SDK 3.1, nRF SDK 15.2, SD 6.1, nRF52840]

Parents
  • Hello,

    Device obviously knows that is should be updated (NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED_NO_AUTH) but it fails to transfer firmware (Killing a TX slot prematurely).

    Did you also restart the transmission vie serial to the first device, or does the device enter in the middle of the same transmission that it was reset? 

    And also when the update is transmission over the Mesh is stopped?

    Then the DFU will eventually time out. Check out TIMER_START_TIMEOUT_US and TIMER_DATA_TIMEOUT_US in nrf_mesh_dfu.c on line 74 and 75.

     

    If we have Client with ID = 1 and Server with ID = 2, and if Server devices are now updated, they still broadcast this new firmware to all devices

     yes. 

     

    and if you want to update Client device, before you can start it is already receiving update from Server devices and this update is transmitted all the time.

     I don't understand what you ask for here.

      

     

    How to stop firmware re-transmission over the Mesh, so that you can update Client without Server firmware being relayed, because if you send init packet via serial to Client you get *84 78 87*. ?
    And how to update Server devices then to newer version after like a day or so if some device is still broadcasting older firmware?

     They will not retransmit for that long. You can't disable retransmits. But again, check out the timeout variables. They decide how long to stay on the same update until it times out.

    BR,

    Edvin

  • It fails in all cases

    Examples
    1. -Run DFU for Server on 4 devices - 1 Client(serial) and 3 Servers
        -In the middle of DFU update, disconnect 1 server
        -Reconnect it again and only "NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED_NO_AUTH" happens and that is it. It goes into adding event to timer but nothing else happens.
        -After some time, 2 servers get updated and that is it. The on disconnected stays outdated and never gets update

    2. -Run DFU for Server on 4 devices - 1 Client(serial) and 3 Servers
        -End DFU successfully
        -Connect new Server into network, which has older version
        -Same thing as in example 1, shows New firmware but nothing happens and it never updates

    So if there is some case when it looses connection/restarts or you connect new one, they will never update to current mesh firmware

  • DFU on Mesh is time consuming. This is because Mesh is a low power network, with low throughput. When you say it isn't working with 3+ jumps. Are they far apart? Do you experience a lot of loss on a regular basis between these nodes? Have you tried turning up the relay count on the nodes? 

    Look in nrf_mesh_config_core.h. What is your MESH_FEATURE_RELAY_ENABLED, and what is your CORE_TX_REPEAT_RELAY_DEFAULT? If they are both 1, can you try to increase CORE_TX_REPEAT_RELAY_DEFAULT to  2 or 3?

    How many nodes are in the network that you are trying to perform the DFU on?

    BR,

    Edvin

  • Hi!

    MESH_FEATURE_RELAY_ENABLED is set to 1

    I tried setting CORE_TX_REPEAT_RELAY_DEFAULT to 2 or 3 but there is no difference.


    1. A week ago I tried
    -compiling Release and creating .zip file
    -Updating with 500ms
    -I forgot to change server app version from 1 -> 4 so it updated thinking it was version 4(I set 4 when making zip file), but then it flashed back to 1 and client devices, are still sending update for server all the time and set other servers into New firmware so it doesn't look this abort timeout is working here because I left devices untouched for atleast 4 days and when you reboot 1 device you always get "New firmware".
    Shouldn't client devices timeout after 10minutes? Why do I get update is in progress after 4 days?

    2. 1 week ago it worked, but when I tried it again today it doesn't work and I get timeout reason after some time like packets where lost in transition


    2. We have our own dfu.c file 

    bool dfu_mode = true;
    static nrf_mesh_evt_handler_t dfu_event_handler;
    
    bool fw_updated_event_is_for_me(const nrf_mesh_evt_dfu_t *p_evt) {
        switch (p_evt->fw_outdated.transfer.dfu_type) {
        case NRF_MESH_DFU_TYPE_APPLICATION:
            return (p_evt->fw_outdated.current.application.app_id == p_evt->fw_outdated.transfer.id.application.app_id &&
                    p_evt->fw_outdated.current.application.company_id == p_evt->fw_outdated.transfer.id.application.company_id &&
                    p_evt->fw_outdated.current.application.app_version < p_evt->fw_outdated.transfer.id.application.app_version);
    
        case NRF_MESH_DFU_TYPE_BOOTLOADER:
            return (p_evt->fw_outdated.current.bootloader.bl_id == p_evt->fw_outdated.transfer.id.bootloader.bl_id &&
                    p_evt->fw_outdated.current.bootloader.bl_version < p_evt->fw_outdated.transfer.id.bootloader.bl_version);
    
        case NRF_MESH_DFU_TYPE_SOFTDEVICE:
            return false;
    
        default:
            return false;
        }
    }
    
    void dfu_event_cb(const nrf_mesh_evt_t *p_evt) {
        switch (p_evt->type) {
        case NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED:
        case NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED_NO_AUTH:
            __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED_NO_AUTH\n");
            if (fw_updated_event_is_for_me(&p_evt->params.dfu))
                ERROR_CHECK(nrf_mesh_dfu_request(p_evt->params.dfu.fw_outdated.transfer.dfu_type, &p_evt->params.dfu.fw_outdated.transfer.id, (uint32_t *)bank_addr));
            else
                ERROR_CHECK(nrf_mesh_dfu_relay(p_evt->params.dfu.fw_outdated.transfer.dfu_type, &p_evt->params.dfu.fw_outdated.transfer.id));
            break;
    
        case NRF_MESH_EVT_DFU_START:
            __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "NRF_MESH_EVT_DFU_START\n");
            dfu_mode = true;
            break;
    
        case NRF_MESH_EVT_DFU_END:
            __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "NRF_MESH_EVT_DFU_END\n");
            dfu_mode = false;
            send_dfu(p_evt->type);
            break;
    
        case NRF_MESH_EVT_DFU_BANK_AVAILABLE:
            __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "NRF_MESH_EVT_DFU_BANK_AVAILABLE\n");
            ERROR_CHECK(nrf_mesh_dfu_bank_flash(p_evt->params.dfu.bank.transfer.dfu_type));
            break;
        default:
            //__LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "Unhandled Mesh Event: %d \n", p_evt->type);
            break;
        }
    }
    
    void dfu_init() {
        dfu_mode = false;
        rom_length = (uint32_t)rom_end - rom_base;
        bank_addr = (uint32_t)(rom_end & FLASH_PAGE_MASK) + FLASH_PAGE_SIZE;
        __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "rom_base   %X\n", rom_base);
        __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "rom_end    %X\n", rom_end);
        __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "rom_length %X\n", rom_length);
        __LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "bank_addr  %X\n", bank_addr);
        dfu_event_handler.evt_cb = dfu_event_cb;
        nrf_mesh_evt_handler_add(&dfu_event_handler);
    }


    Should there be some other event handler here added or is that it?

    3. When trying DFU there is so many options it goes wrong
    -It can just stop when starting *RADIO TX! SLOT 1* and it says "count 99" so like it was trying to send something but couldn't and then it doesn't go on
    -It can Abort randomly with timeout
    -It can say NRF_MESH_DFU_END_ERROR_PACKET_LOSS
    -or NRF_MESH_DFU_END_ERROR_BANK_IN_BOOTLOADER_AREA

    And if I want to stop ALL sending via mesh when making DFU I cannot know any of this until DFU is complete (after 1h) and I can find out that 3/4 of devices got some error and don't work.


    4. What if 
    Client 1 and Server 1 currently have firmware 1 and they are broadcasting it
    Then I run DFU on Client 2 with firmware 2 for Servers.
    4.1 Will Server 1 and others be updated properly?
    4.2 Can this somehow screw DFU if some old DFU is still in the air?
    4.3 After Server update, should I delete DFU on Client devices, so they don't broadcast it anymore?

    5. Is there anything else I can try? 

    In python script I changed random TID to static, so I can send same firmware with same ID if something went wrong, but I don't see this changed have any influence on this problems.

    Thank you in advance!

  • Hello,

     

    Tomi said:
    Shouldn't client devices timeout after 10minutes?

     + 

    Tomi said:
    We have our own dfu.c file 

    It is difficult to say what's going on without knowing the extent of your changes. 

     

    Tomi said:
    Should there be some other event handler here added or is that it?

     You still have the dfu_evt_handler() in nrf_mesh_dfu.c, right?

    In this function, do you still use the TIMER_START_TIMEOUT_US and the TIMER_DATA_TIMEOUT_US? But you never see "Timeout fired @..." anywhere in the log? Or if you set a breakpoint there? (in nrf_mesh_dfu.c in timer_timeout?

    I know it isn't straight forward, but by default, it should work like this:

    One device initiates the DFU (the one being connected to the computer via serial), and starts transmitting the DFU packets. When the first DFU packet is received, each device will get the BLE_EVT_TYPE_DFU_START, and start a timer with timeout TIMER_START_TIMEOUT_US. Then, after that, each incoming DFU packet will trigger the BL_EVT_TYPE_DFU_DATA_SEGMENT, and start/restart the timer with TIMER_DATA_TIMEOUT_US. If any of these timeouts hit, the timer will abort the DFU session. 

    In addition to starting these timers, each node will check whether the packet is intended for itself or not in the mesh_evt_handler() in main.c. It uses fw_updated_event_is_for_me() to check this. As you can see, it will, based on this, either store the data from the packet, then relay it, or just relay it. 

    Can you check whether any of the timeouts fire? You can test this easier by reducing TIMER_DATA_TIMEOUT_US and TIMER_START_TIMEOUT_US to one minute or so, instead of 10 minutes.

  • Hi

    1. Yes dfu_evt_handler() is still present - nrf_mesh_dfu.c is not changed at all and it is same as in Mesh SDK so all this functions are present and also timeouts.
    2. I get Timeout fired @... when for example DFU_END happens and it needs some time to finish DFU.

    3. I think the problem is elsewhere but I don't know exactly where.

    What I did and found out now is that 
    -Devices updated with for example firmware 4 are broadcasting it ALWAYS(even after days). What I mean by that is when you power up a new device, on this new device it will automatically go to dfu_evt_handler() into BL_EVT_TYPE_DFU_NEW_FW, and then it goes into
    NRF_MESH_EVT_DFU_FIRMWARE_OUTDATED_NO_AUTH in dfu.c file.
    You also get 1 "nrf_mesh_dfu.c,  529, RADIO TX! SLOT 0, count 255, interval: periodic, handle: FFFE" but then this is it - DFU itself will not start but they get some packet that some device has newer firmware

    -When I got Timeout yesterday, I had 10+ devices connected nearby. 
    Some of those devices transmitted this Firmware 4 packet to others
    Some had custom new firmware flashed from Segger with Firmware 1
    Some had debug and some had release versions (In debug devices were just connected and used for different purposes, I only had release versions for DFU)
    Some where Clients and some where Servers. So clients all had Firmware version 1 and I was trying to update Servers from a Client.

    -I connected 1 device with Firmware 1 and ran with Segger
    -It said New Firmware (BL_EVT_TYPE_DFU_NEW_FW) - So it got Firmware 4 broadcasted from others
    -I created APP version 10 (same release as in version 4, I just changed numbers in .c file and when creating .zip with nrfutil)
    -So I ran Firmware 10 update from a Client and I always got "BL_EVT_TYPE_DFU_ABORT" with reason "0x3" in different parts of DFU transfer. Usually at around 2000/6200 packets running with 800ms

    So this always failed yesterday.
    First time, when ALL devices were newly flashed and didn't have any newer firmware this update worked but only once.

    -After all that I disconnected all devices and had only 1 Server and 1 Client(for DFU) and DFU ran OK!
    -Now I tried adding few new Server devices and also it looks like DFU is ok, even running with 400ms and 1 server already having this update!

    The question is, what is the problem here.
    Are maybe clients or other devices making problems broadcasting this New firmware packets meanwhile DFU is happening.
    Is it maybe that only 1 Client can run nearby
    Are there too many devices nearby.
    Is it maybe that some device had some Debug version on and did something to DFU to abort it randomly.
    Why does DFU behave strangely in this cases?
    Why are then Client devices broadcasting New firmware to others, if the Firmware is already updated on Servers and this Firmware is not for the Clients. Should I also send DFU_END to all clients when Servers are updated?

    4. When running DFU I see that "... INTERVAL: EXPONENTIAL" - can this have any influence on this problems and can this be changed?

    (In world cases I will have few Clients and many Servers. I will run DFU first for Clients and after Clients are updated also new DFU for Servers (This can also be reversed))

    EDIT
    I found out that when I updated 3 Servers, for Client update to work I had to reset all Server devices, because someting was transmiting from Servers and Client update never worked.
    The problem here is we need to add custom Mesh Node reboot, because we don't want to unprovision device with node reset, only reboot it.

  • Hello,

    Can you check whether you have the correct softdevice in your device page? If the bootloader receives a DFU for an app-id with different SD ID's it will request the SD instead.

    Br,

    Edvin

Reply Children
  • Hi!

    Ok I will check that, because I saw it requesting SD version yes, not APP when device didn't work.
    Maybe I corrupted something when I update device with Release, then attached it to debugger and ran ReleaseWithDebugInformation on it

    Thank you!

  • Hi,

    we have different bootloader for Client and Server, because Client has Serial support and Server doesn't.
    We also have to generate 2 different device_page.hex files, with different BootloaderID and AppId for each, but softdevice is the same.
    Can any of this be the problem, if Client receives update from Server which has different configuration?

    Thank you!


  • If the client receives an update with another application ID it shouldn't care about the content of the DFU image, just relay it. But if it receives an update with the same application ID that uses a later softdevice, it will request the softdevice instead. 

    So what are the configurations on the server and client when you are sending to the servers?

    BR,

    Edvin

  • As I checked everything is OK, softdevice file used is the same in client/server.

    Client is getting DFU image from Server and it says 84 78 87 on Init packet sometimes and you need to power off ALL server devices so you can start DFU normally on Client.
    So when you restart Client it triggers 0xA6 - DFU_FIRMARE_OUTDATED_NO_AUTH command

    When I checked SOFT device case, that triggered in dfu.c it was Segger problem with debugging and it only starts APPLICATION DFU, so this Softdevice case is doesn't happen...

    I don't know what else can I do to make this usable... it works or it doesn't and if it doesn't

  • I kinda solved the problem

    -When you update Server devices first time it works OK.
    -Then when you want to update Client, you have to call DFU_ABORT multiple times (I think Restart also, -before aborting, not 100%)
    -And then you run DFU for Client or for new Firmware

    Problem happens because on restart or start Client is now Getting New firmware from Servers and it messes it up and it doesn't want to start New firmware update.

    This new firmware is floating in the air all the time. This is NEVER discarded. You can only Power OFF all devices and ON to get rid of it, or maybe something else I am not sure about.

    And also
    Python DFU script has some random generated number TID
    If some device is at 10% and you start same DFU again, this TID should stay the same, so that device also gets updated after new DFU gets to 10%, so that it just waits.
    When 1 DFU is done completely I change TID to different number!

Related