This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Erroneous BLE disconnection associated with trivial, unrelated code changes with nRF51822

Hi all,

I am continuing with university project now in its 4th year of development for a wearables leg motion measurement application. Two trackers with identical hardware but slightly differing firmware are used, which are called the femur and tibia trackers as they are placed
on the upper and lower parts of the leg, respectively. The outcome of the project is to have the two trackers communicate with each other using BLE to generate a knee angle measurement, and then to use BLE again to communicate with an application on a phone.
The PCBs are custom designed each have a nRF51822 plus the recommended antenna and related antenna circuitry.
The existing firmware has been written in C using the Nordic BLE SoftDevice API, and I believe it was originally developed from a BLE heart rate service demo application. The Nordic PCA10028 development kit is being used to upload the code to the trackers along with the Keil uVision IDE (see below for the hardware and software details). The femur tracker configured as GAP central and the tibia tracker configured as a GAP peripheral. After a connection is established between the mobile app (either the LightBlue app or my partner's custom application), the app acts as the GATT client and the femur tracker acts as the GATT server. The initialization of these trackers involves a two-step calibration process which must be performed before data can be sent from the femur tracker to the phone. At each stage of the calibration, the mobile application writes a '1' to the calibration characteristic of the BLE profile. The tracker firmware will then initiate the respective calibration step once this calibration write event has been detected.

I have been using the LightBlue app on my iOS device to simulate my partner's application, in the sense that it is simulating a mobile device for the femur tracker to connect to and send notifications. However, I have come across a seemingly inexplicable bug. Essentially what happens is that changing the structure of an if statement inside the update_knee_angle() function, which calculates the knee angle, triggers a disconnection between the femur tracker and my LightBlue iOS app after the first calibration is initiated (which is the action of writing a '1' to the calibration characteristic as explained above). I can say with confidence that this code change somehow triggers a disconnection because it is definitely repeatable; the faulty code consistently causes a disconnection within five seconds and the working code does not. My team partner's application also disconnects from the femur tracker when the faulty firmware is flashed onto the trackers, which suggests that it's most likely a firmware problem as opposed to a problem with the either the LightBlue app or my partner's app. Below is the comparison of the code that does not cause a disconnection and the code that does:

Working code (in update_knee_angle() ):
if ((tibia_info.timestamp - prev_tibia_timestamp) == (leg_data.timestamp - prev_femur_timestamp)) {...}

Problematic code #1:
int tibia_dt = tibia_info.timestamp - prev_tibia_timestamp;
int femur_dt = leg_data.timestamp - prev_femur_timestamp;
if ((tibia_dt) == (femur_dt)) {...}

Note that I have deliberately omitted the contents of the if statement block since the code contents in the above comparison are identical. In the above comparison, the logic is essentially identical, the only difference being the definition of two new integers. To further complicate the behaviour at hand, it was also found that even with the if statement format from the working code, if a SEGGER_RTT_printf() statement was placed inside the if statement block, this would also cause the trackers to disconnect after the first calibration. This code is shown below (in this case I do show some of the contents of the if statement to illustrate the point):

Problematic code #2:
if ((tibia_info.timestamp - prev_tibia_timestamp) == (leg_data.timestamp - prev_femur_timestamp))
{
    SEGGER_RTT_printf( ... arguments etc etc ...);
    // Not an actual comment, but other code lies within this if statement
    ....
    ....
}

Subsequent code development in this update_knee_angle() function also causes connectivity problems such as the femur tracker no longer being visible in the LightBlue app's device enumeration. This subsequent code development intends to remove the original if statement which seems partly responsible for the erroneous disconnections. It seems to be that modifying the code in the update_knee_angle() function in any significant way from the last known working commit is either partly or wholly responsible for these disconnections. I have not tried modifying other functions in our source code to see if this also causes connectivity issues. In the interests of eliminating some easier possibilities, I both increased the RAM size allocation from 0x7000 to 0x8000 (in case there was a stack/heap overflow) and I reduced the compiler optimization level from O3 to O1. Neither of these adjustments, in conjunction or separately, resolved the disconnection behaviour.


Unfortunately my understanding is that it is difficult to debug the code and set breakpoints, or just seems like there is no nice way to set breakpoints to find the issue for the SoC ecosystem (as discussed on DevZone post case ID 100476). I cannot see how a trivial code change in the custom section of the source code could possibly trigger invocations of sd_ble_gap_disconnect() or other BLE GAP/GATTS-related API that ultimately leads to the disconnection between the femur tracker and the phone? Could it possibly be related to timeouts?

Hardware information:
PCA10028 development kit with motherboard firmware 7000
nRF51822 on custom application PCB

Software information:
SDK v11.0.0 or v12.2.0 (here I'm not sure which SDK exactly it is, and I do not know a way of being able to find out)
Keil uVision V5.26.2.0 with MDK-ARM Professional Version: 5.26.2.0
Target DLL: Segger\JL2CM3.dll
Dialog DLL: TARMCM1.DLL
SoftDevice version S130_nRF51_2.0.1
iOS 12.4

Could some people provide advice on what else I can investigate or what I can do to eliminate this erroneous behaviour? I understand that I have only included code snippets, but if it is necessary for me to upload the source code and/or project files to ease the debugging process, please suggest this on this thread.

This would be much appreciated,
Cheers,

inf_sup_bus

Parents
  • Hello,

    SDK v11.0.0 or v12.2.0 (here I'm not sure which SDK exactly it is, and I do not know a way of being able to find out)
    Keil uVision V5.26.2.0 with MDK-ARM Professional Version: 5.26.2.0
    Target DLL: Segger\JL2CM3.dll
    Dialog DLL: TARMCM1.DLL
    SoftDevice version S130_nRF51_2.0.1

     If you are using S130 v2.0.1, then you are probably using SDK12.2.0 if SDK11.0.0 and SDK12.2.0 are your only candidates. SDK11.0.0 uses S130 v2.0.0, while SDK 12.X.0 uses S130 v2.0.1.

    So, the softdevice will not call sd_ble_gap_evt_disconnect() on it's own. Not without telling you.

    My initial guess is that when you change your function, it somehow causes the application to crash, and the other device receives a timeout, because the first device's radio goes silent.

    It is hard to say exactly what the issue is. Do you have any APP_ERROR_CHECK(err_code) that is called within the functions that you check? Have you tried to check if the error handler fires? Try to define "DEBUG" in your preprocessor defines, and disable optimization in the compiler (let me know if you are not sure how to do this), and set a breakpoint on line 48 in app_error.c. Does that breakpoint ever get hit if you are debugging?

    Another possibility, but this is a shot in the dark:

    int tibia_dt = tibia_info.timestamp - prev_tibia_timestamp; 

    What type of variables are tibia_info.timestamp and prev_tibia_timestamp?

    int is not universally defined, so maybe you can try to set it to the same as tibia_info.timestamp and prev_tibia_timestamp, and see if that helps.

    But try to see if the error handler is catching something.

    Best regards,

    Edvin

  • Hello Edvin,

    Thanks for your response. I redefined the integer type of tibia_dt and femur_dt to uint32_t as the .timestamp field of the struct is defined as uint32_t, but this did not resolve the connection issues. Your suggestion that the application is crashing when it enters update_knee_angle() and consequently causes the connection to timeout seems like the most likely scenario here.

    APP_ERROR_CHECK(err_code) invocations are used throughout the application when a SD or nRF API function is called, such as during the general BLE initialisation and transmit/receive routines. However, I do not currently call any API functions in update_knee_angle(). It is my understanding the err_code is a value generally returned from an API function call, so could you elaborate how I could use the error code return value to check what kind of error is being caused in update_knee_angle(), given that I don't have any API function calls in update_knee_angle()? Do you suggest that I check the error codes in the other routines outside of update_knee_angle()?

    The minimum level of optimization available in the C/C++ tab of the Project Options to be O0, which doesn't fully disable optimization but provides the minimum level of optimization. Is there a way to fully disable optimization here?

    I will be able to do some further debugging once my above queries have been clarified,

    Best regards,

  • inf_sup_bus said:
    I redefined the integer type of tibia_dt and femur_dt to uint32_t as the .timestamp field of the struct is defined as uint32_t

     Ok. This was a shot in the dark, but I have seen something like this before. Writing too long to e.g. an uint8_t causing hard faults, because the area it may overwrite could be in the stack pointer, and then there is no way of knowing what can happen.

     

    inf_sup_bus said:
    Do you suggest that I check the error codes in the other routines outside of update_knee_angle()?

     Yes. Most definitely. And the easiest way to do this is:

     

    Edvin said:
    Have you tried to check if the error handler fires? Try to define "DEBUG" in your preprocessor defines, and disable optimization in the compiler (let me know if you are not sure how to do this), and set a breakpoint on line 48 in app_error.c. Does that breakpoint ever get hit if you are debugging?

     Setting O0 should be sufficient. Not disabling optimization would still set the breakpoint on this line, but you may not be able to see the values of the variables, telling you where the APP_ERROR_CHECK() that triggered is located. You should be able to see the variable values when you use O0.

    So does the breakpoint trigger after defining DEBUG in your preprocessor defines?

    BR,

    Edvin

  • So does the breakpoint trigger after defining DEBUG in your preprocessor defines?

    The debugger does not halt the program at L48 in app_error.c. However, I did notice that after stopping the debugger, the 'code pointer' (yellow arrow in uVision) was pointing to line 154 of arm_startup_nrf51.s. This section is pertinent to the HardFault_Handler. The caller code for the HardFault_Handler appears to be an ARM ABI helper-function (page 17 of http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf). The assembler code disassembly window shows that it's occurring around SEGGER_RTT_printf() and ble_central_adv_information_update() function calls. I've also attached a snippet of the suspected location where the hard fault is occurring. Doing some research, I found the following resources useful:

    1. http://www.keil.com/appnotes/files/apnt209.pdf pp. 17-18
    2. Susheel Nuguru's first response to Nordic Devzone case 107574  

    I've attached a screenshot of the debugging view in Keil showing the debugger view. Following the Keil application note advice, the program counter (0x20005E60) contains value 0xB6D70200. Wanting to find out what the next instruction was that caused the fault, I searched the memory to find the value at 0xB6D70200. The result is unusual, as attached: the value here is just question marks. I'm not sure what this means; am I accessing an invalid memory location, or is the value here unknown or undefined?

    Following the advice from case 107574, the address at SP+0x14 is 0x20005E5C which has value 0xB7D70200. My project directory doesn't seem to contain a .map file as Susheel was suggesting, so I couldn't attempt to find the function associated with that location. The value at the location 0xB7D70200 was also displayed as question marks.

    I did note a discrepancy between the two sources I've consulted: the Keil application note states that the program counter is located at an offset of 0x18 from the main stack pointer, but case 107574 implies that the program counter is offset by 0x14. Which should I use?

  • Are you using FreeRTOS in your project?

    What does your ble_central_adv_update() look like?

Reply Children
  • This project is not using FreeRTOS. The snippet of ble_central_adv_update() is attached.

    Would you agree that the application is likely crashing because '(int_16t) knee_angle_deg' is being copied into an unallocated space in memory, which is an illegal memory access, thus firing the HardFault_Handler?

    My reasoning is that it's more likely the illegal memory access that's causing the handler to fire as opposed to copying a 16 bit type into a memory space pointed by a pointer to uint8_t (which is knee_angle in this case), since the word length is 32 bits on the Cortex M0. As far as I know no packing is used in the source code.

  • This is your peripheral, right? I believe it is, but I get a bit confused by the name (ble_central_adv_information_update).

    You should check the return values of your softdevice function calls, such as sd_ble_gatts_hvx();

    I suggest that you change the function from void ble_central_adv_information_update() to uint32_t ble_central_adv_information_update(), and return the value from last function call:

    void ble_central_adv_information_update(ble_mpu_c_t *p_mpu, uint8_t* knee_angle)
    {
    
        if(p_mpu->conn_handle != BLE_CONN_HANDLE_INVALID)
        {
            ... // All the things that you already had
            hvx_params.p_data = (uint8_t*)knww_angle;
            return sd_ble_gatts_hvx(p_mpu->conn_handle, &hvx_params);
        }
        else // p_mpu->conn_handle == BLE_CONN_HANDLE_INVALID
        {
            return NRF_ERROR_INVALID_STATE;
        }
    
    }

    and then check the return value from your function:

    uint32_t err_code;
    
    knee_angle[0] = (int16_t) knee_angle_deg;
    err_code = ble_central_adv_information_update(&m_ble_mpu_c, knee_angle);
    
    if (err_code != NRF_ERROR_INVALIDE_STATE)
    {
        APP_ERROR_CHECK(err_code);
    }

    This way it is easier to check what the application is doing. What is the return value from ble_central_adv_information_update(&m_ble_mpu_c, knee_angle);

  • Hello Edvin,

    I have adjusted the code to how you have suggested. These are the values of the error code, line number and file names in the error_info_t after the application stopped at the breakpoint in app_error_handler.

    The error code has value 0x3401 or 13313 decimal.

  • ok. Closing in. Can you hover over the p_file_name, and see if you can find the name of the file that it is pointing to. Alternatively, copy the memory 1 values that you have in your screenshot, and see the path (if you convert them to ascii/char values. It is somewhere in components\Leg<something-something>. Should be a .c file.

    If you remove the first breakpoint and add the second, it should update the error_info variable, but you can see that the line number "line_num" that is passed onto this function is 0x196 = decimal: 406. So in one of the files in the path components\Leg... there is a APP_ERROR_CHECK(err_code) on line 406. This one receives an err_code != 0. What is the function that returned this err_code?

    EDIT:
    I checked your path, but I believe you can see it if you hover over the mouse. In case not:

    So your file knee_angle.c has an APP_ERROR_CHECK(err_code); on line 406. What function returned this value? If it is not a direct softdevice call, what function inside this function returned this return value?

    Best regards,

    Edvin

  • Hello,

    Yes there is definitely an APP_ERROR_CHECK(err_code) function call in line 406 in knee_angle.c. Removing the first breakpoint and adding the second breakpoint (where the second breakpoint is on line 54 of app_error.c) returns the same error information in the error_info_t structure as in my previous post.

    The err_code in this context is not directly returned by a SoftDevice call, it is returned by ble_central_adv_information_update(...). The definition of ble_central_adv_information_update(...) is defined exactly as you recommended in your response on the 12th August.

    The err_code variable is being returned from sd_ble_gatts_hvx(...), which is called within ble_central_adv_information_update(...). The value of the error code is 13313 decimal or 0x3401, which is the same error code as my last post (essentially, it's the same error that's occurring). This sd_ble_gatts_hvx(...) call is performing a notification, where the knee angle is being transmitted to a remote mobile phone.

Related