Failure in ble_stack_init during startup

SteveHx 20 days ago

We are trying to bring up a new run of an existing PCB design. No changes in the area of the processor, but a different assembly house. The first call inside ble_stack_init, a call to nrf_sdh_enable_request(), fails and returns an error code of 8. No reference anywhere gives a usable explanation.

Strangely, if started by "copying" the hex file onto the Jlink "drive" (the nRF52832 DevKit), it succeeds.

What should we be looking for?

Top Replies

Turbo J 19 days ago in reply to SteveHx +1

You need to flash prerequisites (like SoftDevice and Bootloader) only once before Build&Run starts working. Edit: Disregard, I confused that with different tooling from another IDE. Below part could…

Parents

0 Vidar Berg 20 days ago

Hello,

It seems like the issue might be with how the firmware is programmed since it appears to work when you use drag&drop programming. How are you programming the device when it fails? Is the same error also returned after a reset? Note that the "SoftDevice enable" function may return NRF_ERROR_INVALID_STATE (8) if the debugger forces execution to start from the application start address instead of address 0x0. This prevents the softdevice's reset handler from running on startup.

Best regards,

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 SteveHx 20 days ago in reply to Vidar Berg

I was starting it from SES (Segger) just as I have on dozens of previous nRF52832 projects, and indeed successfully on previous runs of this board. So there's something different in the hardware, and I'm asking where to look. What sort of "invalid state" is likely?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 1 day ago in reply to SteveHx
The debug information is again too inconsistent to start speculating on any root cause. Having the code hang "somewhere" or fail with a logic error is an entirely different symptom from the whole system suddenly resetting with a POR/BOR while executing a busy loop. And the latter seems very unlikely for several reasons, so I still question if this observation can be accurate. I suggest we take a step back and try to focus more on the debugging instead. If you want to verify what was programmed onto the chip, you can read back the flash contents and compare the content.

# Read the entire FLASH of the chip nrfutil device read --address 0 --bytes 0x80000 > flash_content.txt

You can also experiment with resetting the device from nrfutil

nrfutil device reset --reset-kind <RESET_SYSTEM or RESET_PIN>

And use the "cpu-registers_read" command if you want to connect to a running device and find out where the program counter (PC) is at:

nrfutil device cpu-register-read

Briefly about the vector table, first word is always the initial stack pointer, second is the reset handler address + the arm thumb bit. The build code differences were answered already.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 SteveHx 23 hours ago in reply to Vidar Berg

Vidar Berg said:
The debug information is again too inconsistent to start speculating on any root cause. Having the code hang "somewhere" or fail with a logic error is an entirely different symptom from the whole system suddenly resetting with a POR/BOR while executing a busy loop.

Apparently I was unclear. I have never suspected a reset in a busy loop. I did at one point try a busy loop to shift the timing, to eliminate the possibility of some kind of watchdog reset. The resets have always come from the same point within the softdevice, in sd_softdevice_enable().

Vidar Berg said:
read back the flash contents and compare the content.

A quick look at the captured data shows nothing amiss. I can see the vector table, identical to what my code reported, as well as constant strings that are created within my source code. I also programmed via drag&drop, captured the contents, then programmed via F5 and captured the contents. A file compare utility declared the two files to be identical, so it's apparently not a problem with the programming, but a difference in the way the reset is handled - some register within the chip being set differently?

Vidar Berg said:
You can also experiment with resetting the device from nrfutil

Neither of those resets appears to recover the system. That's consistent with the fact that a power cycle does not recover it.

Vidar Berg said:
use the "cpu-registers_read" command

This shows it to be within my code that reports to external nonvolatile memory, which is consistent with what I see. The external memory is relatively slow, so it makes sense that it gets caught most often within that code, as the system loops and repeats, recording every step of the way (well, at least the big steps).

Vidar Berg said:
Briefly about the vector table, first word is always the initial stack pointer, second is the reset handler address + the arm thumb bit.

Aha, here's something new. First I've heard of the "arm thumb bit". A bit of reading suggests that the LSB of the reset handler address, being unused since instructions always begin on an even address, is used to flag the 16-bit instruction set. Did I interpret that correctly?

Vidar Berg said:
The build code differences were answered already.

I'm unclear what this refers to.

Thank you very much for the pointers on the use of nrfutil. That's a very powerful tool that I need to get better acquainted with.

But in the meantime, my client is approaching panic mode because he can't ship product and I'm still spinning my wheels on this problem. Back to the ballgame....

SteveHx said:
1) What is different (and why) in the reset process from drag&drop vs. powerup?

2) What is different between the E1 and E0 revs of the nRF52832QFAA chips?

I have to leave the office now for a dentist appointment. Back in an hour or so, to receive the magic wand you're going to hand me, to solve this whole thing! <grin>
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 SteveHx 10 hours ago in reply to SteveHx

I tried every RESET_ listed for nrfutil device reset --reset-kind. SYSTEM, HARD, SOFT, DEBUG, PIN and DEFAULT do nothing; VIA_SECDOM returns an error message as would be expected.

In the course of testing this, I made an interesting discovery: With the device programmed, if I power-cycle it, it stays in that infinte reset loop. However, if I disconnect USB power from the devkit (which is still connected), the reset happens perfectly and the device comes up.

Danged if I can figure out why that works, but surely it points to something, even if only a band-aid, that we could use to get past this problem. The Reset/P0.21 line already has a 10K pullup to Vcc, just as it has on all past designs where we've used this processor. Hints??
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg 4 hours ago in reply to SteveHx

SteveHx said:
Apparently I was unclear. I have never suspected a reset in a busy loop. I did at one point try a busy loop to shift the timing, to eliminate the possibility of some kind of watchdog reset. The resets have always come from the same point within the softdevice, in sd_softdevice_enable().

How can you be certain that the device is indeed going in a reset loop? Again, there is no code in the softdevice that can trigger a reset and this is also contradicts what you said here:

SteveHx said:
I have been experimenting with different delays near the beginning of main(), before calling any of the BLE initialization. With no delay, or with a null for() loop of 10000 cycles, a reset happens sometime after calling sd_softdevice_enable() and before it returns. If I increase the for() loop to 20000 or higher (up to 10000000), it seems to reset before the loop completes.

SteveHx said:
Vidar Berg said:
The build code differences were answered already.

I'm unclear what this refers to.

This is what I answered earlier:

Vidar Berg said:
Regarding the build code difference, please have a look at the PCN here: https://nrfconnectdocs.nordicsemi.com/pdf/PCN/pcn_106_v1.0.pdf. It's the same silicon. It's just that it's tested and assembled at different locations.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Turbo J 1 hour ago in reply to SteveHx

Try to disable the drive in the Segger J-Link Configuration.

I suspect that you have software on the PC that tries to write a hidden file to the "new" USB drive that pops up. Which the debugger will try to flash as firmware...

That would explain why the file method works, it overwrites the junk data in flash.

You should not need the drive at all, flashing can be done with nrfutil/nrfjprog or with the Segger tools directly.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Turbo J 1 hour ago in reply to SteveHx

Try to disable the drive in the Segger J-Link Configuration.

I suspect that you have software on the PC that tries to write a hidden file to the "new" USB drive that pops up. Which the debugger will try to flash as firmware...

That would explain why the file method works, it overwrites the junk data in flash.

You should not need the drive at all, flashing can be done with nrfutil/nrfjprog or with the Segger tools directly.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 SteveHx 18 minutes ago in reply to Turbo J

I'm running Windows 10, as I have from the very first time I used Nordic and the Segger tools, which have always worked fine. Highly unlikely that anything is writing to that virtual drive; I have scanned for malware within the past few days. Also, I did a complete dump of the flash both after programming via build&run, and after programming via drag&drop. A file compare utility declared the two files to be identical.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 SteveHx 3 minutes ago in reply to SteveHx

Aha, a glimmer of daylight....

It appears that I need to investigate my power supply, even though it's identical to that on the previous rev. Please confirm that the following waveform would be causing resets. Steady at 1.8V, rising to 3.35V over the course of 9.5 mSec, immediately falling to 2.2V over 5.2 mSec, then more steeply to the baseline steady 1.8V over 600 µSec.Cycle repeats every 37.8 mSec, which seems reasonable for the reset loop I'm seeing.

I don't know how the supply can be doing that, but clearly it is. The previous rev does not show this behavior, despite being by design identical. Something must be different in the build, that is causing a power supply disruption as the softdevice starts up. But why would it behave differently when programmed via drag&drop?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel