DFU bootloader get stuck in SoftDevice at 0x0000C100 - 0x0000C0FC then jumps to 0x00016348 - 0x0001635E and loops forever in these code segments

We are using nRF52833 with DFU secure bootloader and implemented DFU master/host on ESP32 BLE. We observe that DFU bootloader get stuck in above locations in softdevice more frequently during OTA.

SDK used: nRF5_SDK_17.0.0_9d13099
SoftDevice used with version: s140_nrf52_7.0.1_softdevice

Following are some more details

1. nRF52833 is in central mode in our application code, and DFU works in peripheral(obviously) mode
2. DFU host is implemented on ESP32 which is on same board where nRF52833 is, so no issues on BLE range
3. nRF52833 always get stuck somewhere in the last steps, where it erases/writes something in flash like bootloader settings
4. As said, it gets stuck at the end, after power cycle we observed that OTA was successful, so the new application code is loaded and even bootloader settings are also altered to make new application active but then it got stuck somewhere
5. On notification from ESP32 over serial port, nRF52833 application code jumps to bootloader by writing GPREGRET register, we have used sd_ wrapper for it as we are using softdevice
6. OTA host activities from ESP32 works perfect every time and it does not get stuck anywhere in any state, so ESP32 completes OTA file transfer and restarts itself but nRF52833 get stuck sometimes
7. According to study of DFU protocol, after new application file transfer nRF52833 do not do any handshake after flash write activities at the end are completed (this is what we observed to RTT debug messages in DFU bootloader as well)
8. After whatever debugging we did so far, we suspect that, it is something related to flash APIs, because every time it gets stuck, we have seen some peer managers log where it said it has updated something in peer manager data. We have tried to find if we can disable peer manager, but there is no way, so we tried with sd_NVICSystemReset which is supposed to reset softdevice as well before going into DFU bootloader mode, but no luck

If you can help me to find/tell which APIs of Softdevice are located at above locations, then we can try to write something in application code before jumping to bootloader to avoid stuck situation.
Also, if you can provide .o of softdevice, which we can disassemble using objdump, that would be extremely helpful, so that we can see which APIs and instructions where it is getting stuck

Below is the jlink step log, when it get stuck

J-Link>h
PC = 0000C0F8, CycleCnt = 04C5F1A4
R0 = 00000000, R1 = 00000000, R2 = 80000000, R3 = 4001F014
R4 = 20000040, R5 = 40000000, R6 = 00000000, R7 = 20006FC9
R8 = 00000000, R9 = 0007C318, R10= 20010000, R11= 00000000
R12= 00070E35
SP(R13)= 2001FE38, MSP= 2001FE38, PSP= 00000000, R14(LR) = 0000C101
XPSR = 2100000B: APSR = nzCvq, EPSR = 01000000, IPSR = 00B (DebugMonitor)
CFBP = 00000001, CONTROL = 00, FAULTMASK = 00, BASEPRI = 00, PRIMASK = 01

FPS0 = 00F93AFB, FPS1 = DFDE7575, FPS2 = 46847A75, FPS3 = C2B9E2FE
FPS4 = FA78CE71, FPS5 = 3CBA2E2E, FPS6 = 40A70424, FPS7 = AB566C9C
FPS8 = C9DCB219, FPS9 = 490E4A11, FPS10= 4B6332CF, FPS11= 2FBBB72C
FPS12= 963BEA63, FPS13= 22B5A828, FPS14= 2D7FC343, FPS15= 2001FD80
FPS16= 00000000, FPS17= 00000000, FPS18= 00000000, FPS19= 00000000
FPS20= 00000000, FPS21= 00000000, FPS22= 00000000, FPS23= 00000000
FPS24= 00000000, FPS25= 00000000, FPS26= 00000000, FPS27= 00000000
FPS28= 00000000, FPS29= 00000000, FPS30= 00000000, FPS31= 00000000
FPSCR= 00000000
J-Link>s
0000C0F8: 20 B9 CBNZ R0, #+0x08
J-Link>s
0000C0FA: 00 20 MOVS R0, #0
J-Link>s
0000C0FC: 0A F0 24 F9 BL #+0xA248
J-Link>s
00016348: 07 49 LDR R1, [PC, #+0x1C]
J-Link>s
0001634A: 10 31 ADDS R1, #16
J-Link>s
0001634C: 0A 68 LDR R2, [R1]
J-Link>s
0001634E: D2 03 LSLS R2, R2, #15
J-Link>s
00016350: 06 D5 BPL #+0x0C
J-Link>s
00016352: 09 68 LDR R1, [R1]
J-Link>s
00016354: 01 F0 03 01 AND R1, R1, #0x03
J-Link>s
00016358: 81 42 CMP R1, R0
J-Link>s
0001635A: 01 D1 BNE #+0x02
J-Link>s
0001635C: 01 20 MOVS R0, #1
J-Link>s
0001635E: 70 47 BX LR
J-Link>s
0000C100: 00 28 CMP R0, #0
J-Link>s
0000C102: F7 D1 BNE #-0x12
J-Link>s
0000C0F4: D5 F8 0C 01 LDR R0, [R5, #+0x10C]
J-Link>s
0000C0F8: 20 B9 CBNZ R0, #+0x08
J-Link>s
0000C0FA: 00 20 MOVS R0, #0
J-Link>s
0000C0FC: 0A F0 24 F9 BL #+0xA248
J-Link>s
00016348: 07 49 LDR R1, [PC, #+0x1C]
J-Link>s
0001634A: 10 31 ADDS R1, #16
J-Link>s
0001634C: 0A 68 LDR R2, [R1]
J-Link>s
0001634E: D2 03 LSLS R2, R2, #15
J-Link>s
00016350: 06 D5 BPL #+0x0C
J-Link>s
00016352: 09 68 LDR R1, [R1]
J-Link>s
00016354: 01 F0 03 01 AND R1, R1, #0x03
J-Link>s
00016358: 81 42 CMP R1, R0
J-Link>s
0001635A: 01 D1 BNE #+0x02
J-Link>s
0001635C: 01 20 MOVS R0, #1
J-Link>s
0001635E: 70 47 BX LR
J-Link>s
0000C100: 00 28 CMP R0, #0

Parents
  • Thanks for your reply, Vidar. I will wait for inputs from SoftDevice team

    Alankar

  • Hi,

    Is it possible to test DFU with nRF connect on Android or iOS as well? It would be interesting to know if it leads to the same problem.

    Thanks,

    Vidar

  • Watchdog is always the best solution to recover from stuck situation, as it is meant for. But it would be helpful if you can figure out a solution for this in Softdevice itself. Please keep us posted on it.

    Also, one more question related to Watchdog, if we enable watchdog in our application, do we need to do any modifications in bootloader, or it will detect that itself and feed it... Also, what is the minimum watchdog time needs to be configured in application so that bootloader operations will work normally.

    We have not reached at the stage of development yet, where we introduce watchdog in the application, but we will surely give it a shot once you answer above queries related to enabling watchdog in an application code.

    Thank you very much for quick support.

    Alankar

  • Yes, the issue is being investigated internally and I will keep you posted on the progress.

    Also, one more question related to Watchdog, if we enable watchdog in our application, do we need to do any modifications in bootloader, or it will detect that itself and feed it...

    The bootloader will detect if the WD is enabled and keep feeding it. There is not specified a minimum timeout. How short do you want it? Not sure if it makes sense to have it any shorter than 1 second.

  • Thank you, Vidar, we would really like to be updated on the progress...

    If bootloader detects it itself (which I assumed/read, just wanted to confirm), then it's great, I will add WDT in our application code and will try it out sometime in next week. We do not need it to be shorter than one second, 3 seconds seems right value at this point of time, but even if we change it, I don't think it will be lesser than 3 seconds. I will try this out and update you on results.

    Adding one more question to the conversation...

    We are using nRF52832(battery powered) also in the same system which works in peripheral mode (talks to nRF52833 central) and uses S132 as a softdevice. OTA update for this nRF52832 also implemented on ESP32 same way as of nRF52833, but we did not observe this stuck issue on it yet. Is that because of S132 or nRF52832 or issue is in there also but not observed yet!

    If you can guide me on nRF52832 as well, that will be helpful, because nRF52832 works on battery and there is no chance of power cycle, so if it gets stuck; system will not send any data unless battery drains, and we put it on charge again. We will eventually implement WDT in that too, but if you can confirm if this issue can exist there as well, we will take the WDT additions on priority.

    Appreciate your help...

    Thanks

    Alankar

  • Hi Alankar,

    We have currently not been able to replicate the issue with s140_nrf52_7.0.1 yet. As mentioned earlier, we had one case where we observed this behavior in the past, but turns it only occurred during internal testing where we didn't use the stock Softdevice configuration. You seem to be the first customer to actually have experienced it.

    (1) Are you able to estimate of how often this problem occurs? Is it like 1 in every 10 DFU attempts, or even less than that? (2) have you tested this on multiple boards? (3) Would it require much effort to test the same on a nRF52833 DK?

    Thanks,

    Vidar...

    Alankar said:
    We will eventually implement WDT in that too, but if you can confirm if this issue can exist there as well, we will take the WDT additions on priority.

    We should be able to answer this when we have confirmed a root cause.

  • Hi Vidar, 

    to answer your questions

    (1) Are you able to estimate of how often this problem occurs?

    Ans: We have around 23 to 25 devices installed in our test lab, and when we do OTA, every time some of the devices shows this issue. It is not device specific and any of device can show this issue. It also happens that if OTA starts working properly, it works well like 15-20 times, but again if you try after some time, it starts showing the issue on some devices. So, no specific pattern, its completely random.

    (2) have you tested this on multiple boards?

    Ans: As said above, we have around 23-25 devices, on which we have tested this

    (3) Would it require much effort to test the same on a nRF52833 DK?

    Ans: I think so, because our product has ESP32 as well which communicates with nRF52833 over UART for initial handshake, we need hardware assembly efforts and code porting efforts as well to check on DK board.

    Do you think that this could be a hardware issue?

     

    For nRF52832, as we have not seen issue on it yet, we are not taking WDT dev activities on first priority but added it in next to-do list. If you can confirm the root cause and your suggestion by that time, we will re-consider the priorities.

    Thanks for your help.

Reply
  • Hi Vidar, 

    to answer your questions

    (1) Are you able to estimate of how often this problem occurs?

    Ans: We have around 23 to 25 devices installed in our test lab, and when we do OTA, every time some of the devices shows this issue. It is not device specific and any of device can show this issue. It also happens that if OTA starts working properly, it works well like 15-20 times, but again if you try after some time, it starts showing the issue on some devices. So, no specific pattern, its completely random.

    (2) have you tested this on multiple boards?

    Ans: As said above, we have around 23-25 devices, on which we have tested this

    (3) Would it require much effort to test the same on a nRF52833 DK?

    Ans: I think so, because our product has ESP32 as well which communicates with nRF52833 over UART for initial handshake, we need hardware assembly efforts and code porting efforts as well to check on DK board.

    Do you think that this could be a hardware issue?

     

    For nRF52832, as we have not seen issue on it yet, we are not taking WDT dev activities on first priority but added it in next to-do list. If you can confirm the root cause and your suggestion by that time, we will re-consider the priorities.

    Thanks for your help.

Children
No Data
Related