Failure in ble_stack_init during startup

We are trying to bring up a new run of an existing PCB design. No changes in the area of the processor, but a different assembly house. The first call inside ble_stack_init, a call to nrf_sdh_enable_request(), fails and returns an error code of 8. No reference anywhere gives a usable explanation.

Strangely, if started by "copying" the hex file onto the Jlink "drive" (the nRF52832 DevKit), it succeeds.

What should we be looking for?

Parents
  • Hello,

    It seems like the issue might be with how the firmware is programmed since it appears to work when you use drag&drop programming. How are you programming the device when it fails? Is the same error also returned after a reset? Note that the "SoftDevice enable" function may return NRF_ERROR_INVALID_STATE (8) if the debugger forces execution to start from the application start address instead of address 0x0. This prevents the softdevice's reset handler from running on startup.

    Best regards,

    Vidar

  • I was starting it from SES (Segger) just as I have on dozens of previous nRF52832 projects, and indeed successfully on previous runs of this board. So there's something different in the hardware, and I'm asking where to look. What sort of "invalid state" is likely?

  • Please confirm whether you changed your code so that it is no longer clearing the RESETREAS register. This step is essential for this test. The second line there is the serial number of the debugger and is not a valid memory address.

  • Yes, and I deleted all extraneous code prior to the BLE initialization. It still crashes in the same place and resets.

  • I find that the Beacon example (hex files) runs fine both from drag&drop and from power up. Thus I looked at the source code in main.c for that example, and tried to make my startup identical. It was a struggle, with unresolved references, but by including a bunch of extraneous (to our application) headers, and ble_lbs.c, I was able to get it to work - sort of.

    The main additions were log_init(), leds_init(), and buttons_init(), none of which are relevant to our application. If I comment out the call to buttons_init(), it runs exactly as before - just fine from drag&drop, hangs in ble_stack_init() from power up. With the call to buttons_init() included, ble_stack_init() trips two errors. The calls to both nrf_sdh_enable_request() and nrf_sdh_ble_enable() return an error code of 8. I have bypassed the APP_ERROR_CHECK calls because they do nothing useful for me, so it's not surprising that the second call above would return the same error code as the first.

    So the two bottom line questions:

    1) What is different (and why) in the reset process from drag&drop vs. powerup?

    2) What is different between the E1 and E0 revs of the nRF52832QFAA chips?

  • Further research: The vector table at 0x00000000 looks like this:

    20000400 00000A81 00000715 00000A61

    How can the reset start address not be a multiple of 4? If I understand correctly, all instruction words are dword aligned.

    Clearly the drag&drop downloader sets up something that the normal reset code does not, and whatever that is, is cleared by a power cycle, only to be reset by another drag&drop, or by starting the debugger. And it's something the E1 version hard reset sets differently than the E0 version.

    Also, studying the .map file leaves me even more confused. There appears to be a vector table at 0x0000000000026200, but the entries are only 2 bytes each. Good trick, if those are 32-bit addresses. This appears to not be the vector table that goes at 0x00000000, because it has things like NMI_Handler, HardFault_Handler, etc. Our code appears to load just above this table, because I see names of my routines here. They are word-aligned, refuting my assertion above that instructions are 32 bits; that makes more sense. Still, that "reset" address of A81 does not make sense.

    I also see NVIC_SystemReset at 0x0000000000029208, and the system boots correctly via that routine. However, whatever was set up in drag&drop has already been done and remains in effect; again, the reset code is missing something important, but what? Also interesting is that there are multiple instances of NVIC_SystemReset, apparently because that's actually a macro, parsed in different files. The one that I've been calling from my code (after an erase operation on NV memory) lives at 0x000000000002ADB4.

    Is there any other useful information to be gleaned from the .map file? How about answers to question 1 & 2 in my previous post?

  • The debug information is again too inconsistent to start speculating on any root cause. Having the code hang "somewhere" or fail with a logic error is an entirely different symptom from the whole system suddenly resetting with a POR/BOR while executing a busy loop. And the latter seems very unlikely for several reasons, so I still question if this observation can be accurate. I suggest we take a step back and try to focus more on the debugging instead. If you want to verify what was programmed onto the chip, you can read back the flash contents and compare the content. 

    # Read the entire FLASH of the chip
    nrfutil device read --address 0 --bytes 0x80000 > flash_content.txt

    You can also experiment with resetting the device from nrfutil

    nrfutil device reset --reset-kind <RESET_SYSTEM or RESET_PIN>

    And use the "cpu-registers_read" command if you want to connect to a running device and find out where the program counter (PC) is at:

    nrfutil device cpu-register-read

    Briefly about the vector table, first word is always the initial stack pointer, second is the reset handler address + the arm thumb bit. The build code differences were answered already.

Reply
  • The debug information is again too inconsistent to start speculating on any root cause. Having the code hang "somewhere" or fail with a logic error is an entirely different symptom from the whole system suddenly resetting with a POR/BOR while executing a busy loop. And the latter seems very unlikely for several reasons, so I still question if this observation can be accurate. I suggest we take a step back and try to focus more on the debugging instead. If you want to verify what was programmed onto the chip, you can read back the flash contents and compare the content. 

    # Read the entire FLASH of the chip
    nrfutil device read --address 0 --bytes 0x80000 > flash_content.txt

    You can also experiment with resetting the device from nrfutil

    nrfutil device reset --reset-kind <RESET_SYSTEM or RESET_PIN>

    And use the "cpu-registers_read" command if you want to connect to a running device and find out where the program counter (PC) is at:

    nrfutil device cpu-register-read

    Briefly about the vector table, first word is always the initial stack pointer, second is the reset handler address + the arm thumb bit. The build code differences were answered already.

Children
  • The debug information is again too inconsistent to start speculating on any root cause. Having the code hang "somewhere" or fail with a logic error is an entirely different symptom from the whole system suddenly resetting with a POR/BOR while executing a busy loop.

    Apparently I was unclear. I have never suspected a reset in a busy loop. I did at one point try a busy loop to shift the timing, to eliminate the possibility of some kind of watchdog reset. The resets have always come from the same point within the softdevice, in sd_softdevice_enable().

    read back the flash contents and compare the content. 

    A quick look at the captured data shows nothing amiss. I can see the vector table, identical to what my code reported, as well as constant strings that are created within my source code. I also programmed via drag&drop, captured the contents, then programmed via F5 and captured the contents. A file compare utility declared the two files to be identical, so it's apparently not a problem with the programming, but a difference in the way the reset is handled - some register within the chip being set differently?

    You can also experiment with resetting the device from nrfutil

    Neither of those resets appears to recover the system. That's consistent with the fact that a power cycle does not recover it.

    use the "cpu-registers_read" command

    This shows it to be within my code that reports to external nonvolatile memory, which is consistent with what I see. The external memory is relatively slow, so it makes sense that it gets caught most often within that code, as the system loops and repeats, recording every step of the way (well, at least the big steps).

    Briefly about the vector table, first word is always the initial stack pointer, second is the reset handler address + the arm thumb bit.

    Aha, here's something new. First I've heard of the "arm thumb bit". A bit of reading suggests that the LSB of the reset handler address, being unused since instructions always begin on an even address, is used to flag the 16-bit instruction set. Did I interpret that correctly?

    The build code differences were answered already.

    I'm unclear what this refers to.

    Thank you very much for the pointers on the use of nrfutil. That's a very powerful tool that I need to get better acquainted with.

    But in the meantime, my client is approaching panic mode because he can't ship product and I'm still spinning my wheels on this problem. Back to the ballgame....

    1) What is different (and why) in the reset process from drag&drop vs. powerup?

    2) What is different between the E1 and E0 revs of the nRF52832QFAA chips?

    I have to leave the office now for a dentist appointment. Back in an hour or so, to receive the magic wand you're going to hand me, to solve this whole thing! <grin>

Related