Debugging a Hardfualt

RE: Debugging a Hardfualt

c cook — Mon, 30 May 2016 15:25:49 GMT

I have Joseph Yiu's book on order and hopefully, that will help me better understand the clues that I are available. Any additional suggestions would be gratefully appreciated.

RE: Debugging a Hardfualt

c cook — Mon, 30 May 2016 15:17:29 GMT

RK - Short answer - no. My timer function, which was not using scheduler, is where the majority of my application executes, either in the handler itself, or calls made by the handler. I agree pin-pointing the exact cause is needed. With your help, I believe the hard fault is being generated while servicing SWI2. Breaking when the hard fault occurs, the call stack often lists tx_buffer_process(). If tx_buffer_process is not listed, then it's usually my event handler for BLE events - which in my application, calls tx_buffer_process, and makes multiple calls to sd_ble_gattc_write(). What's my next step to pin point the cause? I've code in place to check for null pointers, I've tried to trace back to offending calls and can't seem to pinpoint anything. Moving to the timer to use the scheduler seems to have stopped the hard fualts. Which supports the theory of SWI2.

RE: Debugging a Hardfualt

RK — Sat, 28 May 2016 00:34:18 GMT

but did you actually pinpoint what it was about the timer which was causing the hardfault? Timers are not themselves prone to causing them, if you weren't calling sd_* functions from the wrong interrupt level that wasn't it, did you actually work out what the problem was?

RE: Debugging a Hardfualt

RK — Fri, 27 May 2016 03:19:39 GMT

what's at 0x01EB40, or an instruction one before that as the PC moves on before the instruction is actually executed and the stacked address is the return PC, which you aren't actually going to return to.

What is m_evt_schedule_func in your code, are you using the scheduler?

What it looks like here is that you're executing tx_buffer_process() in thread mode when the softdevice interrupt occurs (SWI2), which is fine, it's taken that interrupt and fallen over in the handler code.

If you don't have an event scheduler and are processing events directly inside the SWI2 interrupt then .. it's not possible to be where you are as you cannot have one interrupt interrupt itself.

RE: Debugging a Hardfualt

c cook — Thu, 26 May 2016 12:38:56 GMT

Based on your comments in: How can I distinguish the reason for hardfault? I looked at SP+0x18, which is the PC right? The word following the program counter is the PSP and has the value of: 0x20000026. The link register has: 0xFFFFFFF1 - which signifies a return from an interrupt. Using the last byte of the PSP (0x26) = 38, 38 - 16 = Interrupt 22. Which is the SWI2_IRQHandler.

Based on what I know about ARM, is it possible that my code was executing SWI2_IRQHandler, and a higher priority interrupt occurred?

Now where do I go? nrf_soc.h defines this software interrupt as: '#define SD_EVT_IRQHandler (SWI2_IRQHandler)

The interrupt handler is in softdevice_handler.c and looks like:

void SOFTDEVICE_EVT_IRQHandler(void)
{
    if (m_evt_schedule_func != NULL)
    {
        uint32_t err_code = m_evt_schedule_func();
        APP_ERROR_CHECK(err_code);
    }
    else
    {
        intern_softdevice_events_execute();
    }
}

So based on ARM's documentation I think another interrupt with the same or higher priority tried to execute. What can I look for to verify this is the cause, and how to prevent it?

My app is a central multilink device. Often when the fault occurs, I get an 0x3004 error from sd_ble_gattc_write(). I halt trying to write more data until I get a BLE_EVT_TX_COMPLETE event. I resume calling sd_ble_gattc_write(), the first call returns an error 0x0000, and then the crash occurs. Could the sd120 stack be generating a second SWI2, while I'm still servicing an earlier SWI2 interrupt. Because this second interrupt as the same priority, the ARM hard faults?

EDIT: RK - thanks for the help! I think I've got it fixed.

I have an application timer that was running in the IRQ. I installed the the app timer scheduler library and moved the timer to run there. So far I have not seen any more hard faults.

RE: Debugging a Hardfualt

RK — Thu, 26 May 2016 03:23:47 GMT

Start again time. I just read your original post again and noticed SP+0x14, that's not right, it's SP+0x18 for the PC, SP+0x14 is the LR at the time of the hardfault, ie where you came from before you hit the routine you hardfaulted in. So actually you hardfaulted in tx_buffer_process(), or possibly in APP_ERROR_CHECK(). Given the DEADBEEF in r2 something fun has happened.

So .. how about a dump of the 16 32bit words starting at the SP, ie 0x20003fe0 to 0x20004010 and let's try again.

It's hardfault week. I spent a long time on one yesterday, and ended up cursing the memory watch unit on the nRF52.

RE: Debugging a Hardfualt

c cook — Wed, 25 May 2016 22:02:44 GMT

DK- Thanks for you help. When hardfult occurs, the PC is pointing to my the hardfault handler. The 16 bytes before the SP are: 00 00 00 00 F8 60 02 00 A4 600200 9C 60 02 00 In Keil Register window the SP = 0x200045F0 But the memory window, with SP for the address shows: 0x00003004 (04 30 00 00), which is the error code I get back from a call to sd_ble_gattc_write(), just before the fault occurs. It seems more too much of a coincidence that the memory window shows my error code. Is this a clue as to whats happening? You stated that the 4 values before the SP are R4-R7. So the first four bytes is the old R4? Are these the addresses that where in the SP, LR and PC?

RE: Debugging a Hardfualt

RK — Wed, 25 May 2016 00:31:02 GMT

Doesn't look that close to me - but map files aren't easy to read, especially when all on one line and there could be other entries other places (where's the heap for a start, perhaps you don't use a heap). Try the other things on the list, see where you usually return to from that code, see what's left just off the stack when you hardfault to see if the address is different, or invalid. If you're sure it's the pop, and it looks quite plausible, something corrupted the stack.

As for that post, he gets 0x800 from the fact his stack is 2048 bytes, it's right there in a comment on the line.

RE: Debugging a Hardfualt

c cook — Tue, 24 May 2016 17:14:26 GMT

I was looking at this post and want to use it, in my application. When he calculates stack size it's stackStart - 0x800. Where did the 0x800 come from? Where is my stack size set? I FYI: I'm running a nRF51422 with 32K RAM. SD120 V2.1 and Target settings: IRAM1:0x20002800/0x5800

RE: Debugging a Hardfualt

c cook — Tue, 24 May 2016 16:49:41 GMT

RK - thanks for the response. I'm using nRF51422QFAC with 32K RAM with S120 V2.1
The last 3 addresses in my map file are: m_dfus 0x20003140 Data 48 add_gatts.o(.bss) m_connection_table 0x20003410 Data 96 device_manager_central_cc.o(.bss) __libspace_start 0x200035d4 Data 96 libspace.o(.bss) __temporary_stack_top$libspace 0x20003634 Data 0 libspace.o(.bss) I'm not sure if that counts as "just around" 0x20003FE0

RE: Debugging a Hardfualt

RK — Tue, 24 May 2016 00:01:40 GMT

First guess would be that your stack's too small and you're corrupting the end of it during the tx_buffer_process() code on occasion and so the PC value loaded is wrong and you hardfault. That stack pointer looks low enough to make you wonder.

Things I'd try next. You probably get to this routine via the same path every time, breakpoint in it and see what the link register is on the way in, or the PC on the way out, is it different from 0x1D1F0 (which should be 0x1D1F1 on the stack) which I think is where you've returned to.

At the point of the hardfault you've just popped 4 things off the stack, they should still be in memory just under where the SP is now, what are they? You should recognise the values popped into r4-r6 and the PC. Is the PC value even?

Look at your memory map. Is any data, or the heap, just around the 0x20003FE0 area, that would point to a stack overflow.