nrfjprog times out with error flashing Thingy:91 - Did worker process die?

I'm getting timeout errors when trying to program a Thingy:91 with a J-Link Pro PoE on macOS on an ARM Apple M3 Max.

Here's the command and error message:

❯ west flash -i 1196000003
-- west flash: rebuilding
[0/26] Performing build step for 'tfm'
ninja: no work to do.
[1/7] Performing build step for 'mcuboot_subimage'
ninja: no work to do.
[2/7] Performing install step for 'tfm'
-- Install configuration: "MinSizeRel"
[6/6] Completed 'mcuboot_subimage'
-- west flash: using runner nrfjprog
-- runners.nrfjprog: Flashing file: /Users/chris/cgnd/clients/golioth/thingy91-golioth-workspace/app/build/zephyr/merged.hex
Timed out waiting for progress updates.sing non-volatile memory - block 2 of 2                                         
Did worker process die?
[ #################### ]  14.809s | Erase file - Done erasing                                                          
ERROR: An internal error has occurred, please try again.
NOTE: For additional output, try running again with logging enabled (--log).
NOTE: Any generated log error messages will be displayed.
FATAL ERROR: command exited with status 63: nrfjprog --program /Users/chris/cgnd/clients/golioth/thingy91-golioth-workspace/app/build/zephyr/merged.hex --sectorerase --verify -f NRF91 --snr 1196000003

I get the same error when running nrfjprog manually (I've attached the log file):

❯ nrfjprog --program /Users/chris/cgnd/clients/golioth/thingy91-golioth-workspace/app/build/zephyr/merged.hex --sectorerase --verify -f NRF91 --snr 1196000003 --log
Timed out waiting for progress updates.sing non-volatile memory - block 2 of 2                                         
Did worker process die?
[ #################### ]  16.435s | Erase file - Done erasing                                                          
ERROR: An internal error has occurred, please try again.

However, running nrfjprog --recover appears to fix whatever is causing the errors:

❯ nrfjprog --recover
Recovering device. This operation might take 30s.
Erasing user code and UICR flash areas.

❯ nrfjprog --log --program /Users/chris/cgnd/clients/golioth/thingy91-golioth-workspace/app/build/zephyr/merged.hex --sectorerase --verify -f NRF91 --snr 1196000003
[ #################### ]  13.059s | Erase file - Done erasing                                                          
[ #################### ]   2.836s | Program file - Done programming                                                    
[ #################### ]   2.820s | Verify file - Done verifying

Is there a way to get a better understanding of what's going on here? Is there something that my application code is doing that is preventing the J-Link from being able to flash the device reliably? What is nrfjprog --recover doing to "fix" this issue?

3173.log.log

Parents
  • BTW, I only get these timeout errors when trying to flash a board that is running an existing firmware image. Flashing a board with empty flash works.

    Also, if I first open the nRF Connect for Desktop Programmer GUI software and do "Erase all", and THEN run west flash, it works every time:

    ❯ west flash -i 1196000003
    -- west flash: rebuilding
    [0/26] Performing build step for 'tfm'
    ninja: no work to do.
    [1/7] Performing build step for 'mcuboot_subimage'
    ninja: no work to do.
    [2/7] Performing install step for 'tfm'
    -- Install configuration: "MinSizeRel"
    [6/6] Completed 'mcuboot_subimage'
    -- west flash: using runner nrfjprog
    -- runners.nrfjprog: Flashing file: /Users/chris/cgnd/clients/golioth/thingy91-golioth-workspace/app/build/zephyr/merged.hex
    [ #################### ]  13.381s | Erase file - Done erasing                                                          
    [ #################### ]   2.829s | Program file - Done programming                                                    
    [ #################### ]   2.810s | Verify file - Done verifying                                                       
    Applying system reset.
    Run.
    -- runners.nrfjprog: Board with serial number 1196000003 flashed successfully.

  • Hello,

    Are you using APPROTECT on this board?

    Best regards,

    Michal

  • Any further insights? Same symptoms for me on a M1 using a JLink Pro, when flashing a nrf5340.

    The --recover flag also resolves it for me, but that can't be used when starting a Debug session inside VSCode, because it triggers a rebuild, which is another issue. In that case I can just run "nrfjprog -e" before clicking Debug.

    But it's a long and painful sequence of steps for such a core function.

    I suspect it's related to the partitiion manager some how, and perhaps having a custom pm_static.yml is related. The complicated build process that creates and uploads to the network coprocessor first also seems implicated, but it's very hard to tease apart.

  • We weren't able to reproduce it unfortunately.

    Does it happen on a basic sample like hello world or blinky as well?

    The extension also has an option of attaching the debugger to a running target, could that work as a workaround at least?

    Best regards,

    Michal

  • Okay I came up with a MRE:

    1. Create new application based on the Zephyr Bluetooth Beacon sample.
    2. Replace the prj.conf with the attached*. This just adds an assortment of the configs we are working with. I'm sure there's a smaller set, but it's laborious to minimise and this set does the trick.
    3. Create a build configuration for nrf7002dk_nrf5340_cpuapp_ns and build.
    4. Connect to a nrf7002dk.
    5. Click Flash a few times. In my experience it works the first time after an automatic recover and some very long pauses, but then starts to fail after that with "did worker die?" errors.

    Note I was able to eliminate the pm_static.yml file from the MRE. So I'm now wondering if it's simply the size of the image that triggers the issue. I couldn't pinpoint what the prj.conf addition was that trigged the MRE, but things seem to start tripping up once I added a bunch of them.

    * "attachment": https://pastebin.com/ia4ijcb7

  • Sorry, I was away for a while unfortunately.

    I will have to check with one of my colleagues with access to one of the new Macs if they could test that.

    Best regards,

    Michal

Reply Children
  •  what version of nRF Command Line Tools are you using? I just updated to the latest 10.24.2 build for macOS and I'm not getting the timeout errors I was getting earlier. If you are using an older version, can you try updating to the latest macOS build from https://www.nordicsemi.com/Products/Development-tools/nRF-Command-Line-Tools/Download and see if that fixes it?

  • I was only on 10.24.0. I verified the issue was still present using the MRE above, quit VSCode, disconnected the dev kit, installed 10.24.2, confirmed the new version number for nrfjprog, and ran the MRE again. The behaviour was identical, as far as I could tell: a slow but successful flash the first time after an auto-recovery, then fails on the very next time with "did worker process die".

    I think there was one subtle change - the "did worker process die" notification is no longer sticky, auto-dismissing shortly after it appears. The log however, and the fatal error, look the same.

    Worth a shot, but alas, no cigar!

  • FWIW, I recently upgraded to J-Link v7.96h which prompted me to upgrade my J-Link firmware from compiled "Mar 29 2024", to "compiled May 15 2024". The result is the same, but the log might be a bit more informative:

    Flashing build to 176000446
    west flash -d beacon/build --skip-rebuild --dev-id 176000446

    -- west flash: using runner nrfjprog
    -- runners.nrfjprog: Flashing file: beacon/build/zephyr/merged_domains.hex
    -- runners.nrfjprog: beacon/build/zephyr/merged_domains.hex targets both nRF53 coprocessors; splitting it into: beacon/build/zephyr/GENERATED_CP_NETWORK_merged_domains.hex and beacon/build/zephyr/GENERATED_CP_APPLICATION_merged_domains.hex
    [ ###### ] 0.000s | Erase file - Check image
    [ ##### ] 0.000s | Check image validity - Initialize device info
    [ ########## ] 0.000s | Check image validity - Check region 0 settings
    [ ############### ] 0.000s | Check image validity - block 1 of 2
    [ #################### ] 0.037s | Check image validity - Finished
    [ ############# ] 0.000s | Erase file - Erasing
    [ ########## ] 0.000s | Erasing non-volatile memory - block 1 of 1
    [ #################### ] 0.000s | Erasing non-volatile memory - Erase successful
    [ #################### ] 9.519s | Erase file - Done erasing

    [ ###### ] 0.000s | Program file - Checking image
    [ ##### ] 0.000s | Check image validity - Initialize device info
    [ ########## ] 0.000s | Check image validity - Check region 0 settings
    [ ############### ] 3.236s | Check image validity - block 1 of 2
    [ #################### ] 0.006s | Check image validity - Finished
    [ ############# ] 0.000s | Program file - Programming
    [ ########## ] 0.000s | Programming image - block 1 of 1
    [ #################### ] 0.000s | Programming image - Write successful
    [ #################### ] 1.097s | Program file - Done programming

    [ ###### ] 0.000s | Verify file - Check image
    [ ##### ] 0.000s | Check image validity - Initialize device info
    [ ########## ] 0.000s | Check image validity - Check region 0 settings
    [ ############### ] 3.634s | Check image validity - block 1 of 2
    [ #################### ] 0.009s | Check image validity - Finished
    [ ############# ] 0.000s | Verify file - Verifying
    [ ########## ] 0.000s | Verifying image - block 1 of 1
    [ #################### ] 0.000s | Verifying image - Verify successful
    [ #################### ] 1.097s | Verify file - Done verifying

    [ ###### ] 0.000s | Erase file - Check image
    [ ## ] 0.000s | Check image validity - Initialize device info
    [ ##### ] 0.000s | Check image validity - Check region 0 settings
    [ ####### ] 0.000s | Check image validity - block 1 of 6
    [ ########## ] 0.036s | Check image validity - block 2 of 6
    [ ############ ] 0.027s | Check image validity - block 3 of 6
    [ ############### ] 0.017s | Check image validity - block 4 of 6
    [ ################# ] 0.009s | Check image validity - block 5 of 6
    [ #################### ] 0.088s | Check image validity - Finished
    [ ############# ] 0.000s | Erase file - Erasing
    [ ### ] 0.000s | Erasing non-volatile memory - block 1 of 5
    [ ###### ] 0.272s | Erasing non-volatile memory - block 2 of 5
    [ ########## ] 0.501s | Erasing non-volatile memory - block 3 of 5
    [ ############# ] 0.510s | Erasing non-volatile memory - block 4 of 5
    [ ################ ] 0.225s | Erasing non-volatile memory - block 5 of 5
    Timed out waiting for progress updates.
    Did worker process die?
    [ #################### ] 0.000s | Erasing non-volatile memory - Erase successful
    [ #################### ] 21.276s | Erase file - Done erasing

    ERROR: An internal error has occurred, please try again.
    NOTE: For additional output, try running again with logging enabled (--log).
    NOTE: Any generated log error messages will be displayed.
    FATAL ERROR: command exited with status 63: nrfjprog --program beacon/build/zephyr/GENERATED_CP_APPLICATION_merged_domains.hex --sectorerase --verify -f NRF53 --coprocessor CP_APPLICATION --snr 176000446

    * The terminal process terminated with exit code: 63.



    So in this case it seems the flashing the NET image succeeds, but the process gives up on flashing the APP image because it takes too long, even though it eventually succeeds!

  • I had to do something, this is killing me.

    So I found that the script responsible for the problematic commands is here:

    /opt/nordic/ncs/v2.5.2/zephyr/scripts/west_commands/runners/nrfjprog.py

    After some perusing, I ended up commenting out line 68 and inserting this instead:

                elif erase == 'ERASE_PAGES':
                    self.logger.info('Forcing recover mode')
                    cmd.append('--recover')
    #                cmd.append('--sectorerase')

    In other words, I disabled "sectorerase" and replaced it with "recover". Flashing now takes around 10 seconds instead of well over a minute. Crucially, the erase step that used to take 30 seconds or so and timeout, only takes a second or so, and succeeds.

    I can't begin to tell you the change in ability to get things done that occurs when every time you click on "Build" or "Debug", you get a result within 10 seconds instead of bombing out 60+ seconds later.

    So obviously this is a big dirty hack, and I'm far from confident in the result, but I can see big portions of the script are hopeful hacks anyway. At least I can make progress, and start to work on all the other debugging dramas, like no restart function, GBD crashing and RTT not working. Some time in the future I hope to be able to debug my application, rather than the IDE.

  • I think in this case recover does chip erase instead of sector erase, which is indeed faster.

    If you hover over the "Flash" button in the extension, there should be a small button that does "Erase and Flash" to the right of the main "Flash" button. Does it behave the same by chance as your hack?

    Best regards,

    Michal

Related