Data Bus Error

Hello,

IDE - Visual Studio Code

SDK version - 2.9.0

Computer platform - Win11

  I'm at the early stages of a project which first requires to send ~10K of data to the nrf52dk (configured as 52832) in a series of slices between 300 to ~2,000 bytes per slice.  This is done as write without response so there are three stages each with a separate characteristic:

1. Start transfer sending the total number of bytes which will eb sent

2.Repeated transfers until the number of bytes is reached

3. End transfer detailign how many bytes will be expanded to using LZ4 (though this is not implemented yet)

  I've got a GATT characteristic for each item:

Start Transfer

/**
 * @brief This is called by the GATT service definition, it contains the number of bytes to receieve
 * as a 32 bit little endian unsigned number
 */
static ssize_t start_vid_trans(struct bt_conn *conn,
			  const struct bt_gatt_attr *attr,
			  const void *buf,
			  uint16_t len,
			  uint16_t offset, uint8_t flags)
{
	k_mutex_lock(&trans_mutex, K_FOREVER);
	LOG_INF("Attribute write, handle: %u, conn: %p", attr->handle,
		(void *)conn);

    uint8_t *comp_size_data = (uint8_t *) buf;
    uint32_t num_bytes = comp_size_data[0] | (comp_size_data[1] << 8) | (comp_size_data[2] << 16) | (comp_size_data[3] << 24);
    LOG_DBG("num bytes is %" PRIu32 , num_bytes);
	bw_callbacs.bw_start_trans(num_bytes);
	LOG_INF("returning success");
	return BT_GATT_ERR(BT_ATT_ERR_SUCCESS);
}

  This calls a separate method which mallocs the memory for the data:

/**
 * @brief This is called to notify that a series of bytes will be sent
 * which will represent a compressed image
 */
static int app_bw_start_trans(uint32_t compressed_size){

	if(image_data != NULL){
		cleanup_image_data();
	}
    num_compressed_bytes = compressed_size;
	LOG_INF("Allocating memory to image_data with %" PRIu32 " bytes", num_compressed_bytes);
    image_data = k_malloc(num_compressed_bytes * sizeof(uint8_t));
	memset(image_data, 0, compressed_size);
	LOG_INF("Allocated memory to image_data");
	if (image_data == NULL){
		LOG_WRN("Failed to allocate memory to new image_data buffer");
	}
	
	return 0;
}

  Then the data is repeatedly transferred and the image_data variable has the data appended to it until we reach the end:

/**
 * @brief This is called by the GATT service definition,it contains the number of 
 * bytes of the uncompressed data as a 32 bit little endian unsigned number
 */
static ssize_t end_vid_trans(struct bt_conn *conn,
			  const struct bt_gatt_attr *attr,
			  const void *buf,
			  uint16_t len,
			  uint16_t offset, uint8_t flags)
{

	LOG_INF("Attribute write, handle: %u, conn: %p", attr->handle,
		(void *)conn);
        
    uint8_t *uncomp_size_data = (uint8_t *) buf;
    uint32_t uncompressed_size = uncomp_size_data[0] | (uncomp_size_data[1] << 8) | (uncomp_size_data[2] << 16) | (uncomp_size_data[3] << 24);
    LOG_DBG("we have %" PRIu32 " uncompressed bytes", uncompressed_size);

	bw_callbacs.bw_end_trans(uncompressed_size);
	k_mutex_unlock(&trans_mutex);
	LOG_INF("returning success");
	return BT_GATT_ERR(BT_ATT_ERR_SUCCESS);

}

The end basically just frees up the image_data memory (we'll add the implementation to actually  process the data later):

/**
 * @brief This is called at the end of data trasmit and will prompt the image to stored
 * in permanent memory
 */
static int app_bw_end_trans(uint32_t uncompressed_size){
    LOG_INF("End trans with %" PRIu32 " bytes", uncompressed_size);
	
	LOG_INF("cleaning image data");
	k_free(image_data);
	LOG_INF("freed image data memory");
	image_data = NULL;
    image_offset = 0;
	LOG_INF("set image offset to zero");
	
	return 0;
}

  When I execute this I have success for the first ~5K but then I get the following error:

[00:00:16.530,242] <err> os: ***** BUS FAULT *****
[00:00:16.530,303] <err> os:   Precise data bus error
[00:00:16.530,334] <err> os:   BFAR Address: 0x3e00bd1e
[00:00:16.530,364] <err> os: r0/a1:  0x3e00bd1e  r1/a2:  0x20007f00  r2/a3:  0x784283a5
[00:00:16.530,395] <err> os: r3/a4:  0x00000040 r12/ip:  0x00000004 r14/lr:  0x00021207
[00:00:16.530,426] <err> os:  xpsr:  0x21000000
[00:00:16.530,456] <err> os: Faulting instruction address (r15/pc): 0x00023cc0
[00:00:16.530,548] <err> os: >>> ZEPHYR FATAL ERROR 25: Unknown error on CPU 0
[00:00:16.530,609] <err> os: Current thread: 0x200027b8 (unknown)

  I've used coredump to get a stack trace from this which gives:

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
z_spin_lock_valid (l=0x585f0016, l@entry=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/kernel/spinlock_validate.c:11
11              uintptr_t thread_cpu = l->thread_cpu;
(gdb) bt
#0  z_spin_lock_valid (l=0x585f0016, l@entry=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/kernel/spinlock_validate.c:11
#1  0x000211f6 in z_spinlock_validate_pre (l=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:136
#2  k_spin_lock (l=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:193
#3  k_heap_free (heap=0x0 <store_video>, mem=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/kernel/kheap.c:151
#4  0x000211f6 in z_spinlock_validate_pre (l=0x2000ad28 <net_buf_data_hci_cmd_pool+140>)
    at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:136
#5  k_spin_lock (l=0x2000ad28 <net_buf_data_hci_cmd_pool+140>) at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:193
#6  k_heap_free (heap=0xa4ce2045, mem=0x40 <_irq_vector_table>) at C:/ncs/v2.9.0/zephyr/kernel/kheap.c:151
#7  0x200008b0 in fragments ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

  I'm not sure where to go next; do we just have a case that the calls from the other end (currently a test python script sending data using bleak) is running much faster than the nordic and just overruns the nordic as it cannot keep up with the data?  I'm not sure about that though because I'm not sending _huge_ amounts of data.  If that is the case do I setup a notify system between each slice to give a handshake to central so it 'slows down'?  I've tried increasing the stack size up to 16K but it still fails at the same place.

  Thanks for your help.

Cheers,

Neil

Parents
  • Hi Neil,

      When I execute this I have success for the first ~5K but then I get the following error:

    Thank you for a detailed context of your setup. It very much seems like an assert caused by  memory overflow but the logs does not give much contexts and seems to hide away the details we need.

    Can you please enable the below configs in your prj.conf, pristine build, flash and show the fault logs againi, so that we get the context of this assert? Also show the thread analyzer output so that we can see if all the threads have enough stack space available.

    # Debugging configuration
    CONFIG_THREAD_NAME=y
    CONFIG_THREAD_ANALYZER=y
    CONFIG_THREAD_ANALYZER_AUTO=y
    CONFIG_THREAD_ANALYZER_RUN_UNLOCKED=y
    CONFIG_THREAD_ANALYZER_USE_PRINTK=y
    
    # add asserts
    CONFIG_ASSERT=y
    CONFIG_ASSERT_VERBOSE=y
    CONFIG_ASSERT_NO_COND_INFO=n
    CONFIG_ASSERT_NO_MSG_INFO=n
    CONFIG_RESET_ON_FATAL_ERROR=n
    CONFIG_THREAD_NAME=y

  • Hello,

      Thanks for waiting, unfortunately the fault log from gdb gave exactly the same thing:

    (gdb) target remote localhost:1234
    Remote debugging using localhost:1234
    z_spin_lock_valid (l=0x26000014, l@entry=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/kernel/spinlock_validate.c:11
    11              uintptr_t thread_cpu = l->thread_cpu;
    (gdb) bt
    #0  z_spin_lock_valid (l=0x26000014, l@entry=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/kernel/spinlock_validate.c:11
    #1  0x00021666 in z_spinlock_validate_pre (l=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:136
    #2  k_spin_lock (l=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:193
    #3  k_heap_free (heap=0x0 <store_video>, mem=0x0 <store_video>) at C:/ncs/v2.9.0/zephyr/kernel/kheap.c:151
    #4  0x00021666 in z_spinlock_validate_pre (l=0x2000b2b8 <_net_buf_hci_rx_pool+28>)
        at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:136
    #5  k_spin_lock (l=0x2000b2b8 <_net_buf_hci_rx_pool+28>) at C:/ncs/v2.9.0/zephyr/include/zephyr/spinlock.h:193
    #6  k_heap_free (heap=0x90d42017, mem=0x40 <_irq_vector_table>) at C:/ncs/v2.9.0/zephyr/kernel/kheap.c:151
    #7  0x200008e4 in hci_cmd_pool ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

      I assume the output from the thread analyzer is what is printed out at the start when the device boots up; the log is as follows:

    Thread analyze:
     BT CTLR ECDH        : STACK: unused 256 usage 704 / 960 (73 %); CPU: 14 %
                         : Total CPU cycles used: 814
     BT RX WQ            : STACK: unused 2000 usage 240 / 2240 (10 %); CPU: 0 %
                         : Total CPU cycles used: 1
     thread_analyzer     : STACK: unused 600 usage 424 / 1024 (41 %); CPU: 11 %
                         : Total CPU cycles used: 770
     BT LW WQ            : STACK: unused 1152 usage 192 / 1344 (14 %); CPU: 0 %
                         : Total CPU cycles used: 2
     sysworkq            : STACK: unused 8720 usage 496 / 9216 (5 %); CPU: 0 %
                         : Total CPU cycles used: 63
     MPSL Work           : STACK: unused 736 usage 288 / 1024 (28 %); CPU: 1 %
                         : Total CPU cycles used: 89
     logging             : STACK: unused 1552 usage 496 / 2048 (24 %); CPU: 48 %
                         : Total CPU cycles used: 3883
     idle                : STACK: unused 272 usage 48 / 320 (15 %); CPU: 0 %
                         : Total CPU cycles used: 0
     ISR0                : STACK: unused 1792 usage 256 / 2048 (12 %)

      In terms of memory I've got :

    CONFIG_HEAP_MEM_POOL_SIZE=9216
    CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=9216

      The total amount of data transferred is 13376 bytes in 19 'slices' and the data from each slice is thrown away after it has been received (for now; later it'll be stored in flash).  The smallest slice is 127 bytes and the largest is 1627 bytes.

      I'm trying to find a way to get from the debugger what the memory allocation is after the error but not sure how.  Note that if I want to analyse the coredump I can't build for debug as there is an error in the coredump parser which blocks that.
      I've also tried to slow the sender down by putting a 100ms pause before every write to see what happens and the error is different:
    [00:00:13.475,646] <inf> brikwiz_base: Blueooth connected
    [00:00:14.243,560] <inf> brikwiz_base: Attribute write, handle: 0, conn: 0x20001df0
    [00:00:14.243,591] <inf> brikwiz_base: Allocating memory to image_data with 334 bytes
    [00:00:14.243,652] <inf> brikwiz_base: Allocated memory to image_data
    [00:00:14.243,652] <inf> brikwiz_base: returning success
    [00:00:14.468,902] <wrn> bt_hci_core: Unhandled event 0x00 len 10: 0200180276000602001f
    [00:00:14.993,438] <inf> brikwiz_base: Attribute write, handle: 0, conn: 0x20001df0
    [00:00:14.993,469] <inf> brikwiz_base: End trans with 10240 bytes
    [00:00:14.993,469] <inf> brikwiz_base: cleaning image data
    [00:00:14.993,499] <inf> brikwiz_base: freed image data memory
    [00:00:14.993,499] <inf> brikwiz_base: set image offset to zero
    [00:00:14.993,530] <inf> brikwiz_base: returning success
    [00:00:15.113,433] <inf> brikwiz_base: Attribute write, handle: 0, conn: 0x20001df0
    [00:00:15.113,433] <inf> brikwiz_base: Allocating memory to image_data with 838 bytes
    [00:00:15.113,555] <inf> brikwiz_base: Allocated memory to image_data
    [00:00:15.113,555] <inf> brikwiz_base: returning success
    [00:00:16.538,269] <err> os: ***** USAGE FAULT *****
    [00:00:16.538,299] <err> os:   Illegal use of the EPSR
    [00:00:16.538,330] <err> os: r0/a1:  0x2000b29c  r1/a2:  0x2000581c  r2/a3:  0xffffffff
    [00:00:16.538,330] <err> os: r3/a4:  0xffffffff r12/ip:  0x00000004 r14/lr:  0x00013d53
    [00:00:16.538,360] <err> os:  xpsr:  0x20000000
    [00:00:16.538,360] <err> os: Faulting instruction address (r15/pc): 0x000133dc
    [00:00:16.538,391] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
    [00:00:16.538,452] <err> os: Current thread: 0x200029a0 (MPSL Work)
    This gave me a different trace on gdb:
    (gdb) target remote localhost:1234
    Remote debugging using localhost:1234
    arch_swap (key=536916636) at C:/ncs/v2.9.0/zephyr/arch/arm/core/cortex_m/swap.c:49
    49              return _current->arch.swap_return_value;
    (gdb) bt
    #0  arch_swap (key=536916636) at C:/ncs/v2.9.0/zephyr/arch/arm/core/cortex_m/swap.c:49
    #1  0x00013d52 in data_alloc (timeout=..., size=0x200057fc <mpsl_work_stack+892>, buf=0x0 <store_video>)
        at C:/ncs/v2.9.0/zephyr/lib/net_buf/buf.c:215
    #2  net_buf_alloc_len (pool=0xffffffe4, size=<optimized out>, timeout=...) at C:/ncs/v2.9.0/zephyr/lib/net_buf/buf.c:322
    #3  0x00000000 in ?? ()
      Though the trace still seems to be 'lost in the weeds'.

      Is there any advice you could offer please?

    Cheers,

    Neil
  • Hi Neil,

    [00:00:16.538,452] <err> os: Current thread: 0x200029a0 (MPSL Work)

    We have rarely seen this faults in the MPSL thread and when ever we see it, we tried to fix it like this one

    Also I see that your BT CTRL ECDH might be running close to limits in some scenarios.

    Try using CONFIG_BT_CTLR_ECDH_IN_MPSL_WORK=n in your prj.conf and I think this most likely will fix it  If you do not use this at all I suggest disabling it CONFIG_BT_CTLR_ECDH=n in your prj.conf

Reply
  • Hi Neil,

    [00:00:16.538,452] <err> os: Current thread: 0x200029a0 (MPSL Work)

    We have rarely seen this faults in the MPSL thread and when ever we see it, we tried to fix it like this one

    Also I see that your BT CTRL ECDH might be running close to limits in some scenarios.

    Try using CONFIG_BT_CTLR_ECDH_IN_MPSL_WORK=n in your prj.conf and I think this most likely will fix it  If you do not use this at all I suggest disabling it CONFIG_BT_CTLR_ECDH=n in your prj.conf

Children
  • Hello,

      Thanks but I had the same error, however I've fixedit and it was a silly typo in the code.  The method that copies the data into the image_data array is as follows:

    /**
     * @brief This will represent a series of bytes for an image
     * @param data the data to be stored
     * @param len the number of bytes in the data
     */
    static int app_bw_vid_trans(uint8_t *data, uint32_t len){
    	
    	k_mutex_lock(&trans_mutex, K_FOREVER);
        memcpy(data + image_offset, data, (size_t) len);
    	k_mutex_unlock(&trans_mutex);
        image_offset += len;
        LOG_DBG("got data of %" PRIu32 " bytes total received is %" PRIu32 " bytes", len, image_offset);
    	
    	return 0;
    }
    

      If you look closely at the memcpy line; it isn't copying the incoming data to the image_data array but over the data array itself which goes back to the buffer into the bluetooth stack which is likely causing the memory overflow.

      I basically lead myself down a blind alley rather than reading the code properly!  Thank yo for your help and apologies for wasting your time - at least I've learned some debugging tricks Slight smile

    Cheers,

    Neil

Related