Instruction Access Violation when sending data over Bluetooth

Hi Nordic Devzone,

I am trying out a custom board based on the nrf52840. I have implemented a few things already, mostly regarding basic bluetooth communication and gathering data from the sensors on the board. Everything has worked fine so far. Now, I would like to send the gathered data over to a phone app using a process developped on another project, so I adapted the code to work on zephyr, however, I keep getting the following error right after I send data over bluetooth

[00:00:58.587,066] <err> os: ***** MPU FAULT *****
[00:00:58.587,463] <err> os:   Instruction Access Violation
[00:00:58.587,951] <err> os: r0/a1:  0x20003260  r1/a2:  0x20009988  r2/a3:  0x00000000
[00:00:58.588,592] <err> os: r3/a4:  0x200099c0 r12/ip:  0x0000001e r14/lr:  0x0003df3d
[00:00:58.589,233] <err> os:  xpsr:  0x60000000
[00:00:58.589,721] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x00000000  s[ 2]:  0x00000000  s[ 3]:  0x00000000
[00:00:58.590,484] <err> os: s[ 4]:  0x00000000  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
[00:00:58.591,247] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:58.592,010] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:58.592,773] <err> os: fpscr:  0x2000d1d9
[00:00:58.593,200] <err> os: Faulting instruction address (r15/pc): 0x200099c0
[00:00:58.593,780] <err> os: >>> ZEPHYR FATAL ERROR 20: Unknown error on CPU 0
[00:00:58.594,360] <err> os: Current thread: 0x20003138 (BT RX)
[00:00:58.599,853] <err> fatal_error: Resetting system

I can't figure out why I get this error, because I use the same function to send data over bluetooth in other parts of the code and it works fine. I based my code off of the NUS service sample, here are some extracts of the parts that cause problems:

sync_manager.c :

uint8_t data[6];
                data[0] =
                    (uint8_t)
                        evt->area_id;  // Data char. ID (manual or automatic)
                data[1] = (uint8_t)(pkt_data[evt->area_id].timestamp & 0xFF);
                data[2] =
                    (uint8_t)((pkt_data[evt->area_id].timestamp >> 8) & 0xFF);
                data[3] =
                    (uint8_t)((pkt_data[evt->area_id].timestamp >> 16) & 0xFF);
                data[4] =
                    (uint8_t)((pkt_data[evt->area_id].timestamp >> 24) & 0xFF);

                if (evt->area_id == AREA_STORAGE) {
                    data[5] = pkt_data[evt->area_id].end_reason;
                } else {
                    data[5] = (uint8_t)ds_get_setting(DS_END_PROCESSED_REAS_INDEX);
                }

                ble_eqss_control_queue(evt->area_id == AREA_STORAGE ? PERIPHERAL_SYNC_SESSION_REQ : PERIPHERAL_SYNC_REQ, data, sizeof(data));

The error occurs when ble_eqss_control_queue is called. Its definition is the following :

static uint8_t _queue_command(uint8_t opcode, uint8_t *data, uint8_t length)
{
    if (length > EQS_CONTROL_DATA_MAX_LEN)
    {
        LOG_ERR("Data length too long");
        return 0;
    }
    else
    {
        uint8_t dest[EQS_CONTROL_DATA_MAX_LEN];
        memcpy(&dest[0], &opcode, sizeof(uint8_t));
        memcpy(&dest[1], &length, sizeof(uint8_t));
        memcpy(&dest[2], data, length);
        char str[60];
        sprintf(str, "Building Opcode %.2x - Length %.2x - Payload", opcode, length);
        LOG_HEXDUMP_INF(data, length, str);

        int8_t ret = control_send(NULL, dest, length + 2);
        if (ret == 0)
        {
            return 1;
        }
        else
        {
            LOG_ERR("Failed to send opcode %.2x (%i)", opcode, ret);
            return 0;
        }
    }
}

uint32_t ble_eqss_control_queue(uint8_t opcode, uint8_t* data, uint8_t len){
    return _queue_command(opcode, data, len);
}

And finally, control_send, which does the actual sending, is almost identical to bt_nus_send from the NUS sample:

int control_send(struct bt_conn *conn, const uint8_t *data, uint16_t len)
{
	struct bt_gatt_indicate_params params = {0};
	const struct bt_gatt_attr *attr = &eqs_service.attrs[2];

	params.attr = attr;
	params.data = data;
	params.len = len;
	params.func = control_on_sent;

	if (!conn)
	{
		LOG_DBG("Indication sent to all connected peers (control)");
		return bt_gatt_indicate(NULL, &params);
	}
	else if (bt_gatt_is_subscribed(conn, attr, BT_GATT_CCC_INDICATE))
	{
		return bt_gatt_indicate(conn, &params);
	}
	else
	{
		return -EINVAL;
	}
}

All these functions work properly in other parts of the code, but for some reason I got the previous error when using it in sync_manager, which is confusing, because I don't think I am using it any differently than before. Some examples:

uint8_t data[AREA_COUNT-1];
for (int ii = 0; ii < AREA_COUNT-1; ii++) {
    data[ii] = storage_buffer_get_nb_area(ii);
}
EQS_INFO("Number of sessions in RAW_IMU: %u - RAW_ECG: %u - PROCESSED: %u", data[AREA_RAW_IMU], data[AREA_RAW_ECG], data[AREA_PROCESSED]);
ble_eqss_control_queue(RESPONSE_BIT | PERIPHERAL_SYNC_MEM_STATUS, data, sizeof(data));

static void _resp_sys_serial()
{
    uint8_t temp[8] = {0};
    hwinfo_get_device_id(temp, 8);
    _queue_command((uint32_t)(PERIPHERAL_SYS_SERIAL | RESPONSE_BIT), temp, 8);
    return;
}

Using arm-none-eabi-addr2line to check the faulting instruction address reported in the error, got "??:0".

When I looked around I found that this type of error is often related to thread stack size, so I used the thread analyzer to see the thread stack usage. At some point, the main stack was reaching 95% stack usage, so I increased CONFIG_MAIN_STACK_SIZE to 4096. Because the error states that the thread in which the issue occurs is BT_RX, I also increased CONFIG_BT_RX_STACK_SIZE and CONFIG_BT_HCI_TX_STACK_SIZE to 4096 to be safe, but it did not solve the problem.

Do you have any idea of what I could try next to solve my issue please ?

Thanks in advance for your help.

Nicolas G.

Parents
  • At some point, the main stack was reaching 95% stack usage, so I increased CONFIG_MAIN_STACK_SIZE to 4096. Because the error states that the thread in which the issue occurs is BT_RX, I also increased CONFIG_BT_RX_STACK_SIZE and CONFIG_BT_HCI_TX_STACK_SIZE to 4096 to be safe, but it did not solve the problem.

    I think you are still on the right track. This still seems like a buffer/stack issue. Make sure that the buffers you pass (pointers) are valid (make your application buffers static so that their memory allocation is valid throughout the run) 

    Also see if you need to increase the stack size of SYSTEM_WORKQUEUE_STACK_SIZE.

  • Hi Susheel, thanks for your reply,

    I looked into the pointers I give to the ble_eqss_control_queue function, but nothing stood out. Also since I posted, I noticed that I get a response, and the data I receive on the nrfConnect Android app is correct, so the data I pass must be okay.

    I put the Thread Analyzer up again to see if the system workqueue stack size needed increasing, but I don't see it in the list of threads published by the Analyzer. I increased it anyway to 4096, but no improvements.

Reply
  • Hi Susheel, thanks for your reply,

    I looked into the pointers I give to the ble_eqss_control_queue function, but nothing stood out. Also since I posted, I noticed that I get a response, and the data I receive on the nrfConnect Android app is correct, so the data I pass must be okay.

    I put the Thread Analyzer up again to see if the system workqueue stack size needed increasing, but I don't see it in the list of threads published by the Analyzer. I increased it anyway to 4096, but no improvements.

Children
  • [00:00:58.593,200] <err> os: Faulting instruction address (r15/pc): 0x200099c0

    The faulting instruction address seems to be in the RAM, unless you have explicitly made some code run from the RAM, this still seems to me like a bad pointer dereference or a stack overflow.

    You need to figure out which thread owns 0x200099c0 address space.

    You should be able to re-build the project with CONFIG_ARM_MPU=n which inturn will allow code execution from RAM without triggering the MPU fault and helps us proceed further to debug and find out the context and nature of this fault. You should then be able to place a breakpoint at the faulting RAM address and (hopefully) find out exactly where the branch to this invalid address occurred in the first place. In VS Code extension you should be able to set breakpoint on RAM address access.

  • Thanks again for your help. I am not sure how to proceed to do what you advise though.

    Using the nrf Debug tab in VS Code, I get the following when the program crashes :

    What I gather from this is that:
    1/ the stack usage seem to be okay for all the threads at this particular moment.

    2/ after a few modifications to test out some stuff, the faulting instruction is now located at 0x2000bf48. Looking at this adress in the memory explorer, I can see that this adress is part of z_main_stack. I imagine that means that the BT RX thread tries to access 0x2000bf48, which it does not have access to, and causing the fault. Am I correct ? However I did not find how to place a breakpoint at the faulting RAM adress. Do you have any tips on how to do that please ?

  • Following the answer in this post: https://stackoverflow.com/questions/63605692/how-can-i-set-a-c-c-memory-watchpoint-in-vscode

    I seemed to be able to set a watchpoint on the address that is reported in the error, however the breakpoints must cause an issue with bluetooth, so I can't really see what calls the error

  • Nicolas Goualard said:
    I can see that this adress is part of z_main_stack. I imagine that means that the BT RX thread tries to access 0x2000bf48, which it does not have access to, and causing the fault. Am I correct ?

    Sorry for the delayed response Nicoloas. That does not make sense that the BT RX thread tries to access the memory of the main stack, unless there are any buffer exchanges between these two threads. 

    It is still important for us to get the context of the failure. 

    Try to run the thread viewer to get an overview of all threads and the memory usage.

  • struct bt_gatt_indicate_params params = {0};

    make it global, as the params should be kept valid, try after making them global should work

    control_on_sent function return pointer from stack becomes invalid 

Related