Failure in ble_stack_init during startup

We are trying to bring up a new run of an existing PCB design. No changes in the area of the processor, but a different assembly house. The first call inside ble_stack_init, a call to nrf_sdh_enable_request(), fails and returns an error code of 8. No reference anywhere gives a usable explanation.

Strangely, if started by "copying" the hex file onto the Jlink "drive" (the nRF52832 DevKit), it succeeds.

What should we be looking for?

Parents
  • Hello,

    It seems like the issue might be with how the firmware is programmed since it appears to work when you use drag&drop programming. How are you programming the device when it fails? Is the same error also returned after a reset? Note that the "SoftDevice enable" function may return NRF_ERROR_INVALID_STATE (8) if the debugger forces execution to start from the application start address instead of address 0x0. This prevents the softdevice's reset handler from running on startup.

    Best regards,

    Vidar

  • I was starting it from SES (Segger) just as I have on dozens of previous nRF52832 projects, and indeed successfully on previous runs of this board. So there's something different in the hardware, and I'm asking where to look. What sort of "invalid state" is likely?

  • In nrf_sdh_enable_request(), it's resetting in this critical region:

        CRITICAL_REGION_ENTER();
    #ifdef ANT_LICENSE_KEY
        ret_code = sd_softdevice_enable(&clock_lf_cfg, app_error_fault_handler, ANT_LICENSE_KEY);
    #else
        ret_code = sd_softdevice_enable(&clock_lf_cfg, app_error_fault_handler);
    #endif
        m_nrf_sdh_enabled = (ret_code == NRF_SUCCESS);
        CRITICAL_REGION_EXIT();

    According to the highlighting by SES, ANT_LICENSE_KEY is not defined, so it's resetting semoewhere in sd_softdevice_enable, but apparently without getting to app_error_fault_handler. I have no visibility into the softdevice, so please wave your magic wand over it and tell me hwat I'm doing wrong.

  • The limitation when debugging with the softdevice is that you generally cannot continue execution after halting the CPU (for example, when hitting a breakpoint), as the softdevice will detect that it has failed to meet its timing requirements and trigger an error (app_error_fault_handler() is called with id=NRF_FAULT_ID_SD_ASSERT). But I don't see why this should be a problem here.

    In any case, based on what you said earlier, the problem is not reproducible in debug mode. So you would need to use nrfjprog --memrd <register address> from the command line or similar approach to read the register from the running target after the problem has occurred. The challenge with this here is that you may end up reading the register after it has already been cleared by your FW. So I suggest commenting the NRF_POWER -> RESETREAS = ResetReason; line first, then see if it still reads as 0 when it is going in the boot loop. 

    n nrf_sdh_enable_request(), it's resetting in this critical region:

    There is nothing that can lead to NVIC_SystemReset() being called within this function. You can search through your project for NVIC_SystemReset() to see if is called in any other places than the error handler to verify.

    I added the code you suggested above, and never got it to output (never reached).

    Does this mean that you have redefined the  app_error_fault_handler() in your project? The picture I posted is showing the default handler used in the SDK examples.

    I will be signing off now (I'm based in Norway, so I'm a few hours ahead), but I hope you will be able to gather some more concrete debugging evidence of what is happening.

  • The debugger is so cumbersome as to be completely useless with a softdevice. Hence I have devised my own debugging methods. I have nonvolatile memory available, and a mechanism whereby to record any events of interest, even in a failure to start, such that subsequently getting a successful startup allows me to read the whole history. Very flexible, with recording of whatever plain text is most illuminating for the problem being diagnosed, and it works quite well.

    I have added code within app_error_fault_handler() to record all the details if that handler is called. I just now ran another complete test:

    1) Boot properly via drag&drop, run erase command on NV memory.
    2) Boot via SES F5, observe LED showing repeated resets.
    3) Boot properly via drag&drop, download saved journal.

    I could send the full journa/, but probably more useful to summarize.

    1st boot (1 above) is shown properly.

    2nd attempt, using F5:
       Starts up, gets thru services_init() in main(), then crashes somewhere in advertising_init() with the following message:

       Fault 00004001 00000000 2000FE8C ERROR 4 [NRF_ERROR_NO_MEM] at (null):0
       PC at: 0x00000000

    The above message was formatted by these two lines (in the fault handler). Infotext does just what's expected; these two sprintf's are physically in different routines:

        sprintf (InfoText, "ERROR %u [%s] at %s:%u\r\nPC at: 0x%08x",
                              p_info->err_code,
                              nrf_strerror_get(p_info->err_code),
                              p_info->p_file_name,
                              p_info->line_num,
                              pc);
        sprintf (Str, "\r\nFault %08X %08X %08X %s\r\n", id, pc, info, InfoTxt);

    Thus the values are:

    id - 00004001
    pc - 00000000 (?!?!?)
    info - 2000FE8C
    err_code - 4
    err_string - NRF_ERROR_NO_MEM
    file_name - null
    line_num - 0
    pc - 00000000

    So clearly something went off the rails badly.

    Next the system reboots, and gets as far as calling nrf_sdh_enable_request() from within ble_stack_init(). This drills down to nrf_sdh_enable_request(), which runs to a print just prior to the CRITICAL_REGION, and then reboots from within the CRITICAL_REGION. The only thing called from within there is sd_softdevice_enable(), so the reboot is happening from inside the softdevice. Recall that this is after the above no-memory error.

    From here, it repeatedly reboots and crashes in the CRITICAL_REGION until I intervene and cause a correct reboot via drag&drop in order to read out the record.

    Are you as baffled as me yet?!?

  • Bottom line is:

    The system starts up fine from either loading firmware via drag&drop or from a programmed NVIC_SystemReset. If started from a powerup or from SES pressing F5, it resets somewhere inside the CRITICAL_REGION within nrf_sdh_enable_request(). The only code that could do that is in either CRITICAL_REGION_ENTER(), sd_softdevice_enable(), or CRITICAL_REGION_EXIT(). Those routines tinker heavily with various interrupt flags. Is it possible that some kind of timing or race condition could be causing this?

  • Some more digging:

    Per the above, the first time in from a startup via either power cycle or SES F5, both nrf_sdh_enable_request() and nrf_sdh_ble_enable() return an error code of 8, and sometime after, the error handler gets called with the report above - no pc, nothing to indicate from what it was called.

    I have also added probes that show that the reset is coming from the call to sd_softdevice_enable(), but that is probably just a symptom of the soft device not being properly started above. Once it happens, though, I get continual resets, entering nrf_sdh_enable_request() and crashing with a reset in sd_softdevice_enable() every time.

    I'll be asleep when you come in to work this morning (Monday) as we're 6 hours behind you. But I should be online by about noon your time.

Reply
  • Some more digging:

    Per the above, the first time in from a startup via either power cycle or SES F5, both nrf_sdh_enable_request() and nrf_sdh_ble_enable() return an error code of 8, and sometime after, the error handler gets called with the report above - no pc, nothing to indicate from what it was called.

    I have also added probes that show that the reset is coming from the call to sd_softdevice_enable(), but that is probably just a symptom of the soft device not being properly started above. Once it happens, though, I get continual resets, entering nrf_sdh_enable_request() and crashing with a reset in sd_softdevice_enable() every time.

    I'll be asleep when you come in to work this morning (Monday) as we're 6 hours behind you. But I should be online by about noon your time.

Children
  • The inconsistencies are making it difficult to troubleshoot this issue. For example, earlier you indicated that the problem did not occur while debugging:

    SteveHx said:
    If run with Build and Debug, it runs fine

    But now it fails advertising_init() with NRF_ERROR_NO_MEM, or in sd_softdevice_enable() with NRF_ERROR_INVALID_STATE.

    SteveHx said:
    I have also added probes that show that the reset is coming from the call to sd_softdevice_enable(), but that is probably just a symptom of the soft device not being properly started above

    The softdevice call can return an error code, but it cannot trigger a system reset. Assuming this function is only called once during startup, the only reason it can return NRF_ERROR_INVALID_STATE is that the softdevice's reset handler was not executed during the boot sequence due to the debugger.

    Please confirm that you have the following debugger setting in your project configuration 

    Just to illustrate, here is what it looks like if I enable this setting on my side:

    Crash log if I let it run to error handler:

    If I place a breakpoint at the line number reported by the crashlog:

    And when I step into the nrf_sdh_enable_request() function:

    SteveHx said:
    The system starts up fine from either loading firmware via drag&drop or from a programmed NVIC_SystemReset

    I assume you are calling NVIC_SystemReset() from your error handler, since the device is entering a boot loop. In that case, any runtime error should have causes the device to recover from this state.

  • Nothing inconsistent except the assumption that by F5 I mean Build&Debug. F5 on SES is Build&Run. To summarize:

    If run via drag&drop firmware loading, it runs fine for hours.

    If run via F5 (Build&Run), or if run from a power cycle, it resets continuously.

    When I trigger an erase of my external NV memory via a BLE command, the last step is to do NVIC_SystemReset, and that reset results in a successful run.

    If I could get it to run from power-up, I'd be done.

    Running identical firmware on a previous rev board does not result in these resets, so clearly there's some hardware difference. Is it possible that a reset could be caused by the 32 KHz oscillator not yet being up and stable at the call to start up BLE? I'm well aware that the osc is sensitive to loading, so directly probing with my scope is not conclusive. How best to determine if the 32 KHz osc is running, via some section of the chip I can access in my code?

  • Probing the idea of the LF clock further. On a board that starts up fine from F5 or power up, I sample LFCLKSTAT early in the startup - all zeroes. I attempt to start it with

    NRF_CLOCK -> TASKS_LFCLKSTOP;
    NRF_CLOCK -> LFCLKSRC = 1;
    NRF_CLOCK -> TASKS_LFCLKSTART;

    And then test LFCLKSTAT up to 1 million times, always zero. On the good board, proceeding thru the BLE initialization and then testing LFCLKSTAT yields the expected 00010001. So I seem to have added another mystery: Why can't I start the clock? And the fact that it hasn't started prior to calling ble_stack_init() doesn't seem to be a problem. But if ble_stack_init() can't get it started, that could be the source of the hangups / resets on the new rev board.

    Am I at all close here?

  • F5 has always been assigned to "Debug->Go" on my setups (linux, macOS, and Windows with SES v5.x), hence my assumption that you were starting a debug session. "Debug->Go" is equivalent to "Build -> Build&Debug". Either way, thanks for clearing this up.

    I would say the expected failure mode for an LF clock issue is either that the program hangs forever inside sd_softdevice_enable() because the oscillator fails to start, or BLE connectivity issues due to frequency drift. What it cannot do is cause the invalid state error you are seeing.

    I'm still convinced that the issue is that the softdevice's reset handler was skipped during boot. Again, this is expected and a known limitation when using build&run. I do not know why you have not experienced this before with other boards, but the problem can easily get masked as I mentioned earlier. If you want to test for this hypothesis you can add the code below temporary to your project.

    #define RESET_HANDLER_ADDR (*(volatile uint32_t *) 0x4)
    #define SOFTDEVICE_RESET_HANDLER_EXECUTED 0xAA
    
    typedef void (*reset_handler_t)(void);
    
    static void run_reset_handler_once(void)
    {
        if (NRF_POWER->GPREGRET == SOFTDEVICE_RESET_HANDLER_EXECUTED)
        {
            /* Already ran fon af previous reset — skip */
            NRF_POWER->GPREGRET = 0xFF;
            return;
        }
    
        /* Mark it before jumping, in case reset handler doesn't return */
        NRF_POWER->GPREGRET = SOFTDEVICE_RESET_HANDLER_EXECUTED;
    
        ((reset_handler_t)(RESET_HANDLER_ADDR))();
    }
    
    /**@brief Function for application main entry.
     */
    int main(void)
    {
        /* For debugging purposes - manually invoke the MBR->Softdevice's reset handler */
        run_reset_handler_once();
        

  • I'm not sure that would cause the continuous resets I'm seeing, but that's just the code I needed to test for a proper reset. Will run and get back to you shortly.

Related