Error handling in production - saving the address of the function containing an APP_ERROR_CHECK call

Hello!

We are trying to improve the in-production error handling for one of our products. In the past, we have always saved the LSB of the error code in the GPREGRET register, but that is not always enough information to characterize or solve complicated problems.

We have external eeprom and have confirmed that saving records to eeprom in our app_error_fault_handler function does not create any problems. The issue we are having is with getting the information that we need to effectively deal with the errors.

We know that when DEBUG=TRUE APP_ERROR_CHECK will call app_error_handler with the line number and filename magically obtained from the __LINE__ and __FILE__ macros, but when we build without DEBUG, these are not available. We would prefer to build without DEBUG=TRUE, both to keep the application size manageable and because we don't want most of the other effects of DEBUG=TRUE.

If we just change the definition of APP_ERROR_HANDLER so that it always calls app_error_handler instead of app_error_handler_bare it sort of gets us what we want, but the application still gets bloated with all the extra file paths.

#ifdef DEBUG
#define APP_ERROR_HANDLER(ERR_CODE)                                    \
    do                                                                 \
    {                                                                  \
        app_error_handler((ERR_CODE), __LINE__, (uint8_t*) __FILE__);  \
    } while (0)
#else
#define APP_ERROR_HANDLER(ERR_CODE)                                    \
    do                                                                 \
    {                                                                  \
    	app_error_handler((ERR_CODE), __LINE__, (uint8_t*) __FILE__);  \
    } while (0)
#endif

What we would like to do is to save the address of the function containing the APP_ERROR_CHECK call so that we can look it up in the .map file when we get the log report. 

I was hoping that we would be able to run __builtin_return_address(0) in app_error_handler_bare

void app_error_handler_bare(ret_code_t error_code)
{
    error_info_t error_info =
    {
        .line_num    = (uint32_t)__builtin_return_address(0),
        .p_file_name = NULL,
        .err_code    = error_code,
    };

    app_error_fault_handler(NRF_FAULT_ID_SDK_ERROR, 0, (uint32_t)(&error_info));

    UNUSED_VARIABLE(error_info);
}

to get the address of the thing that ran before it, but the address that I get doing that are all over the place. sometimes I get a different address for subsequent APP_ERROR_CHECK calls at the same location. Is there any way to reliably get the address of the function that called APP_ERROR_CHECK?

Relevant info: building with gcc using nrf5 sdk version 15.2

Thanks In advance

Parents
  • Hi jrowe,

    I find it a very interesting topic you are having here. 

    Several years ago, I also have this exact same need. I experimented with assigning a number to each application source file. I wasn't very happy about it because it felt too non-standard, not very maintainable. For the small project I had at the time, it works, but at that scale, the cost of using __FILE__ is very little anyway, so the benefit is moot...
    I think your idea to use the address of function to log error is a lot nicer though and should also be more scalable.

    Firstly, I see that you are trying to use __builtin_return_address(0) to replace use of __LINE__. This doesn't save you as much space as replacing use of __FILE__. Like you have found, the file names are what consuming the memory here. Meanwhile, __LINE__ are pretty important to figure out where exactly in a function that things go wrong.

    Next, regarding the return value of __builtin_return_address(0). That function gives you where the current function should return. So that would be an address somewhere in the caller function, not the address of the caller function itself.

    You said the address that __builtin_return_address() returns something completely random, could you please elaborate how random it is?

    In particular, when you match the return value against the MAP file, is the return value within the range of the caller function at all?

    Similarly, when multiple of your app_error_handler_bare() calls were made successively at the same place, does the return value of __builtin_return_address() in the subsequent calls seem just a few addresses after the previous one?

    I tested the theory on the UART example and got the expected result. Below is my test code and output.

    I don't have the same environment as you do, so I went ahead with SEGGER Embedded Studio v5.42a (which supposedly uses ARM GCC for compiling) and nRF5 SDK v17.1.0. 

    // Need to increase TX buffer size to fit the test outpu
    #define UART_TX_BUF_SIZE 4096
    
    ...
    
    void a(void);
    void b(uint8_t);
    
    void a(void) {
        uint8_t i;
        b(1);
    
        printf("addr of a(): %p\r\n", &a);
        printf("addr of b(): %p\r\n", &b);
        
        for (i = 0; i < 2; i++) {
            printf("a() - loop %d\r\n", i);
            b(2);
            b(3);
        }
    
    }
    
    void b(uint8_t num) {
        uint8_t i;
        void* p;
        for (i = 0; i < 2; i++) {
            printf("num = %d | __builtin_return_address(0) returns %p\r\n", num, __builtin_return_address(0));
            printf("num = %d | __builtin_return_address(0) returns %p\r\n", num, __builtin_return_address(0));
        }
    }
    
    int main(void) {
        ...
        a();
        ...
    }

    Output:

    num = 1 | __builtin_return_address(0) returns 00001265
    num = 1 | __builtin_return_address(0) returns 00001265
    num = 1 | __builtin_return_address(0) returns 00001265
    num = 1 | __builtin_return_address(0) returns 00001265
    addr of a(): 0000125d
    addr of b(): 00001221
    a() - loop 0
    num = 2 | __builtin_return_address(0) returns 00001285
    num = 2 | __builtin_return_address(0) returns 00001285
    num = 2 | __builtin_return_address(0) returns 00001285
    num = 2 | __builtin_return_address(0) returns 00001285
    num = 3 | __builtin_return_address(0) returns 0000128b
    num = 3 | __builtin_return_address(0) returns 0000128b
    num = 3 | __builtin_return_address(0) returns 0000128b
    num = 3 | __builtin_return_address(0) returns 0000128b
    a() - loop 1
    num = 2 | __builtin_return_address(0) returns 00001299
    num = 2 | __builtin_return_address(0) returns 00001299
    num = 2 | __builtin_return_address(0) returns 00001299
    num = 2 | __builtin_return_address(0) returns 00001299
    num = 3 | __builtin_return_address(0) returns 000012fd
    num = 3 | __builtin_return_address(0) returns 000012fd
    num = 3 | __builtin_return_address(0) returns 000012fd
    num = 3 | __builtin_return_address(0) returns 000012fd
    

    Hieu

  • Hello Hieu,

    You are right, I had I the wrong idea about the __builtin_return_address function. It still isn't quite behaving the way that I would expect, but it makes much more sense.

    It is pretty rare for the address that I get to actually be in the range of the calling function. More often it is within the range of a function one or two levels up the call stack.

    For example, If I put an APP_ERROR_CHECK call in a button event handler in our application I get the address 0x00042639.

    In the .map file, the address range for this function is 

    .text.mode_ble_button_cb
    0x000357a0 0x34 ./src/modes/mode_ble.o
    0x000357a0 mode_ble_button_cb

    mode_ble_button_cb is calling app_error_handler_bare, and I am calling __builtin_return_address(0) inside of app_error_handler_bare. The interesting thing is the while the address that it returns is not in the range of mode_ble_button_cb, it is within the range of the function two steps up the call stack.

    .text.button_event_handler
    0x00042630 0x2c ./src/main.o

    In our application, button_event_handler calls a router called mode_specific_buttn_event, which (in this scenario) calls mode_ble_button_cb.

    If I actually put the test exception in the button_event_handler function, I get the address 0x00042637, which is within the range of button_event handler.

    None of the intermediate functions are inline, but our optimization settings are set to -Os. I strongly suspect that optimizer is condensing our functions more than I expected.

    To test this, I tried calling __builtin_return_address(0) inside of nested dummy functions. It didn't matter how many layers of dummy functions I put in. It would always give me an address in the range of the top level function.

    I also re-ran the test that I thought was giving me different addresses for an exception at the same line. What was happening was that the fake exception that I was putting in to test the error reporting system was causing a different real exception to fire in a different place. 

    While the address that we get isn't perfectly clear,  it is still pretty useful information. If we also include the information from the __LINE__ macro I think it will give us enough information to find out where our errors are coming from.

    Thanks for explaining the __builtin_return_address function and for running the tests.

    Justin

Reply
  • Hello Hieu,

    You are right, I had I the wrong idea about the __builtin_return_address function. It still isn't quite behaving the way that I would expect, but it makes much more sense.

    It is pretty rare for the address that I get to actually be in the range of the calling function. More often it is within the range of a function one or two levels up the call stack.

    For example, If I put an APP_ERROR_CHECK call in a button event handler in our application I get the address 0x00042639.

    In the .map file, the address range for this function is 

    .text.mode_ble_button_cb
    0x000357a0 0x34 ./src/modes/mode_ble.o
    0x000357a0 mode_ble_button_cb

    mode_ble_button_cb is calling app_error_handler_bare, and I am calling __builtin_return_address(0) inside of app_error_handler_bare. The interesting thing is the while the address that it returns is not in the range of mode_ble_button_cb, it is within the range of the function two steps up the call stack.

    .text.button_event_handler
    0x00042630 0x2c ./src/main.o

    In our application, button_event_handler calls a router called mode_specific_buttn_event, which (in this scenario) calls mode_ble_button_cb.

    If I actually put the test exception in the button_event_handler function, I get the address 0x00042637, which is within the range of button_event handler.

    None of the intermediate functions are inline, but our optimization settings are set to -Os. I strongly suspect that optimizer is condensing our functions more than I expected.

    To test this, I tried calling __builtin_return_address(0) inside of nested dummy functions. It didn't matter how many layers of dummy functions I put in. It would always give me an address in the range of the top level function.

    I also re-ran the test that I thought was giving me different addresses for an exception at the same line. What was happening was that the fake exception that I was putting in to test the error reporting system was causing a different real exception to fire in a different place. 

    While the address that we get isn't perfectly clear,  it is still pretty useful information. If we also include the information from the __LINE__ macro I think it will give us enough information to find out where our errors are coming from.

    Thanks for explaining the __builtin_return_address function and for running the tests.

    Justin

Children
No Data
Related