This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Running into a SIGTRAP, backtrace shows only main(), apparently happening in libgloss/arm/crt0.S

I'm using SDK v16 on an nRF52832 and occasionally run into a SIGTRAP with my custom application mainly forwarding BLE traffic from/to UART.

It works fine, only after quite a few interactions (can be 3 can be 30) - initiated by a BLE UART client running on Android - it stops being responsive.

Attached debugger (Black Magic Probe) provides me with the following output:

Starting program: /data/src/nrf5x-sdk-vanilla/projects/[..]/s132/armgcc/_build/nrf52832_xxaa.out Program received signal SIGTRAP, Trace/breakpoint trap. warning: while parsing target memory map (at line 1): Required element <memory> is missing 0x0002be5c in main () (gdb) l 1 ../../../../../../../../../libgloss/arm/crt0.S: No such file or directory. (gdb) bt #0 0x0002be5c in main () (gdb)

I'd be happy for any hint or idea. I could think of this being a arbitrary memory corruption. However I'm wondering about the SIGTRAP (not SEGV), libgloss/arm/crt0.S (no user code), as well as consistently ending up in this very state.

Parents

0 daten over 5 years ago

Compiling with -DDEBUG, -g3 and -O0 reveals some more:

Program received signal SIGTRAP, Trace/breakpoint trap.
warning: while parsing target memory map (at line 1): Required element <memory> is missing
0x0002ce36 in app_error_fault_handler (id=16385, pc=225711, info=536936400)
    at ../../../../../../components/libraries/util/app_error_weak.c:100
100	    NRF_BREAKPOINT_COND;
(gdb) bt
#0  0x0002ce36 in app_error_fault_handler (id=16385, pc=225711, info=536936400)
    at ../../../../../../components/libraries/util/app_error_weak.c:100
#1  0x0002ccc4 in app_error_handler (error_code=16385, line_num=225711, 
    p_file_name=0x4001 "\211\240\201hh\200\211\340\201\233\346\020&O\360#\b")
    at ../../../../../../components/libraries/util/app_error_handler_gcc.c:49
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

0 daten over 5 years ago in reply to daten
So error_code=16385 is 0x4001 which according to components/libraries/util/app_error.h is

(NRF_FAULT_ID_SDK_RANGE_START + 1) /**< An error stemming from a call to @ref APP_ERROR_CHECK or @ref APP_ERROR_CHECK_BOOL. The info parameter is a pointer to an @ref error_info_t variable. */

which is already bringing me closer - telling me it's a result from an APP_ERROR_CHECK() call (not explaining the corrupted stack yet, though). Now trying to figure out which APP_ERROR_CHECK() call.

Unfortunately the info appears to be screwed. According to above comment for the define, info=536936400 is supposed to be a pointer to an instance of struct error_info_t, containing the information I'm looking for. Trying to access it via GDB however results in;

(gdb) p *((error_info_t*)(info)) Cannot access memory at address 0x2000ffd0

Besides I do wonder about p_file_name=0x4001. How did the error_code make it as arg towards p_file_name which appears to actually contain a pointer?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

0 daten over 5 years ago in reply to run_ar

Ok, so after having debugged and fixed the firmware of my debugger, back to the actual issue:

When setting a breakpoint on app_error_fault_handler I get the following:

Starting program: /data/src/nrf5x-sdk-vanilla/[..]/s132/armgcc/_build/nrf52832_xxaa.out 
Note: automatically using hardware breakpoints for read-only addresses.

Breakpoint 1, app_error_fault_handler (id=16385, pc=225911, info=536936392) at ../../../../../../components/libraries/util/app_error_weak.c:58
58	    __disable_irq();

while the backtrace still looks messy:

(gdb) bt
#0  app_error_fault_handler (id=16385, pc=225911, info=536936392) at ../../../../../../components/libraries/util/app_error_weak.c:58
#1  0x0002cce4 in app_error_handler (error_code=16385, line_num=225911, p_file_name=0x2000ffc8 "\226\006")
    at ../../../../../../components/libraries/util/app_error_handler_gcc.c:49
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

So I again went for the noop(), which results into the following code:

            ret = ble_nus_data_send(&m_nus, m_uart_buf[uart_buf_id], (uint16_t *)(&(m_uart_buf_pos[uart_buf_id])), m_conn_handle);
            if ((ret != NRF_ERROR_INVALID_STATE) &&
                (ret != NRF_ERROR_BUSY) &&
                (ret != NRF_ERROR_NOT_FOUND))
            {
              if(ret != 0)
                noop(ret);
              APP_ERROR_CHECK(ret);

resulting in:

Starting program: /data/src/nrf5x-sdk-vanilla/sensorberg/projects/smartspaces/sdg03/s132/armgcc/_build/nrf52832_xxaa.out 
Note: automatically using hardware breakpoints for read-only addresses.

Breakpoint 1, noop (err_code=1 '\001') at ../../../main.c:1568
1568	}

having a seemingly not (yet) messed up stack:

(gdb) bt
#0  noop (err_code=1 '\001') at ../../../main.c:1568
#1  0x00037266 in main () at ../../../main.c:1686

So at that point we know that ble_nus_data_send() sometimes returns 0x01.

Continuing execution now leads me right into the NRF_BREAKPOINT_COND with the corrupt stack:

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0002ce56 in app_error_fault_handler (id=16385, pc=225917, info=536936392) at ../../../../../../components/libraries/util/app_error_weak.c:100
100	    NRF_BREAKPOINT_COND;
(gdb) bt
#0  0x0002ce56 in app_error_fault_handler (id=16385, pc=225917, info=536936392)
    at ../../../../../../components/libraries/util/app_error_weak.c:100
#1  0x0002cce4 in app_error_handler (error_code=16385, line_num=225917, 
    p_file_name=0x4001 "\211\240\201hh\200\211\340\201\233\346\020&O\360#\b")
    at ../../../../../../components/libraries/util/app_error_handler_gcc.c:49
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

So there's 2 questions now:

a) why does ble_nus_data_send() sometimes returns 0x01?
b) why does APP_ERROR_CHECK() /appear/* to end up with a corrupted stack?

*obviously the actual root cause can be somewhere else and APP_ERROR_CHECK() only accesses memory already corrupted somewhere/-when before

0 run_ar over 4 years ago in reply to daten

a) Are you sure ret is actually 0x01 in this case? Here is the error codes that can be returned by sd_ble_gatts_hvx

b) This will happen as the timing in the SD will be messed up when you halt at a breakpoint. i.e. the timers will continue to run and the event manager will be lost.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 daten over 4 years ago in reply to run_ar

Re a) I don't know what else to tell from the GDB output, so yes, fairly sure

Re b) it's the same corrupted stacktrace I get /without/ the breakpoint. See initial post (corrupted stacktrace without breakpoints set) and the one with breakpoints right after the break point in noop() is called only a line later. It's the same corrupted backtrace within APP_ERROR_CHECK(). Doesn't look like a co-incidence or GDB/breakpoint related (timing-)issue to me.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 run_ar over 4 years ago in reply to daten

Are you able to recreate this issue on a nordic DK? and what hardware are you currently running your code on?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 daten over 4 years ago in reply to run_ar

Hardware is an nRF52832.

I now ordered a PCA10040 and will then try to reproduce. Can you elaborate on why you think this might be hardware specific?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 daten over 4 years ago in reply to run_ar

Hardware is an nRF52832.

I now ordered a PCA10040 and will then try to reproduce. Can you elaborate on why you think this might be hardware specific?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 run_ar over 4 years ago in reply to daten

Sorry for the late reply, Just back from vacation.

daten said:
Hardware is an nRF52832.

Ok, so custom board then I suppose. I do not necessarily think it is a hardware issue, but it could be related to clock source/timing. Asking in case we would like to try to recreate the issue here.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel