Corrupted Heap Bounds Error

Hello,

To provide some context, I am using an ANNA-B112 with nRF52832 on a custom PCB board we manufactured for evaluation. The nRF52832 is setup with the NUS service with MTU changed to 251 chars. 

I have been noticing strange behavior on 1/10 of the boards. On this board, if a message is sent longer than around 70 chars, this is causing the system to have a bus fault. Here are the RTT logs from the bus fault:

ASSERTION FAIL [left_chunk(h, right_chunk(h, c)) == c] @ WEST_TOPDIR/zephyr/lib/os/heap.c:183
        corrupted heap bounds (buffer overflow?) for memory at 0x2000a364
[00:01:45.722,381] <err> os: r0/a1:  0x00000004  r1/a2:  0x000000b7  r2/a3:  0x00000002
[00:01:45.722,412] <err> os: r3/a4:  0x20001d58 r12/ip:  0x0000000c r14/lr:  0x000125d1
[00:01:45.722,412] <err> os:  xpsr:  0x41000000
[00:01:45.722,442] <err> os: Faulting instruction address (r15/pc): 0x0002a3b2
[00:01:45.722,473] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[00:01:45.722,503] <err> os: Current thread: 0x200033d8 (unknown)
[00:01:45.957,092] <err> fatal_error: Resetting system
*** Booting nRF Connect SDK v2.5.0 ***

I turned on assert and this told me there is a corrupted heap bounds error, and that it failed at the heap.c file on line 183. Before I turned on assert, the error stated a faulting instruction address which, using addr2line, lead me to heap.c file on line 58. The functions that faulted on heap.c are below:

heap.c line 183:

void sys_heap_free(struct sys_heap *heap, void *mem)
{
	if (mem == NULL) {
		return; /* ISO C free() semantics */
	}
	struct z_heap *h = heap->heap;
	chunkid_t c = mem_to_chunkid(h, mem);

	/*
	 * This should catch many double-free cases.
	 * This is cheap enough so let's do it all the time.
	 */
	__ASSERT(chunk_used(h, c),
		 "unexpected heap state (double-free?) for memory at %p", mem);

	/*
	 * It is easy to catch many common memory overflow cases with
	 * a quick check on this and next chunk header fields that are
	 * immediately before and after the freed memory.
	 */
	__ASSERT(left_chunk(h, right_chunk(h, c)) == c,
		 "corrupted heap bounds (buffer overflow?) for memory at %p",
		 mem);

	set_chunk_used(h, c, false);
#ifdef CONFIG_SYS_HEAP_RUNTIME_STATS
	h->allocated_bytes -= chunksz_to_bytes(h, chunk_size(h, c));
#endif

#ifdef CONFIG_SYS_HEAP_LISTENER
	heap_listener_notify_free(HEAP_ID_FROM_POINTER(heap), mem,
				  chunksz_to_bytes(h, chunk_size(h, c)));
#endif

	free_chunk(h, c);
}

heap.c line 58:

static void free_list_remove_bidx(struct z_heap *h, chunkid_t c, int bidx)
{
	struct z_heap_bucket *b = &h->buckets[bidx];

	CHECK(!chunk_used(h, c));
	CHECK(b->next != 0);
	CHECK(h->avail_buckets & BIT(bidx));

	if (next_free_chunk(h, c) == c) {
		/* this is the last chunk */
		h->avail_buckets &= ~BIT(bidx);
		b->next = 0;
	} else {
		chunkid_t first = prev_free_chunk(h, c),
			  second = next_free_chunk(h, c);

		b->next = second;
		set_next_free_chunk(h, first, second);
		set_prev_free_chunk(h, second, first);
	}

#ifdef CONFIG_SYS_HEAP_RUNTIME_STATS
	h->free_bytes -= chunksz_to_bytes(h, chunk_size(h, c));
#endif
}

I am starting to notice these errors more often after I had increased the heap size from 2048 to 4096. Here is my prj.conf file for context.

CONFIG_ASSERT=y

# Logging Module
CONFIG_LOG=y
CONFIG_UART_CONSOLE=n
CONFIG_USE_SEGGER_RTT=y
CONFIG_LOG_BACKEND_RTT=y

# Enable Thread debugging
# CONFIG_DEBUG_THREAD_INFO=y
# CONFIG_DEBUG_OPTIMIZATIONS=y

# RAM
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=2048
CONFIG_HEAP_MEM_POOL_SIZE=4096
CONFIG_MAIN_STACK_SIZE=4096

# BLE Settings
CONFIG_BT=y
CONFIG_BT_PERIPHERAL=y
CONFIG_BT_DEVICE_APPEARANCE=833
CONFIG_BT_DEVICE_NAME="A040000010"
CONFIG_BT_MAX_PAIRED=1
CONFIG_BT_RX_STACK_SIZE=2048

# For Requesting ATT_MTU Update
CONFIG_BT_GATT_CLIENT=y
CONFIG_BT_USER_DATA_LEN_UPDATE=y
CONFIG_BT_CTLR_DATA_LENGTH_MAX=251
CONFIG_BT_BUF_ACL_TX_SIZE=251
CONFIG_BT_BUF_ACL_RX_SIZE=251
CONFIG_BT_L2CAP_TX_MTU=247

# BLE NUS Service
CONFIG_BT_NUS=y
CONFIG_BT_NUS_AUTHEN=n

# Buttons and LEDs Library
CONFIG_DK_LIBRARY=y

# Use External LFCLK for UART RX Byte Counting
CONFIG_UART_0_NRF_HW_ASYNC=y
CONFIG_UART_0_NRF_HW_ASYNC_TIMER=2

# Async UART
CONFIG_UART_ASYNC_API=y
CONFIG_NRFX_UARTE0=y
CONFIG_SERIAL=y
CONFIG_GPIO=y
CONFIG_CONSOLE=y
CONFIG_UART_0_NRF_PARITY_BIT=n
CONFIG_BT_NUS_UART_BUFFER_SIZE=200

# Analog Pins
CONFIG_ADC=y

# Power Management 
CONFIG_PM=y
CONFIG_PM_DEVICE=y
CONFIG_PM_DEVICE_RUNTIME=y

# NVS Memory
CONFIG_FLASH=y
CONFIG_FLASH_MAP=y
CONFIG_FILE_SYSTEM=y
CONFIG_FILE_SYSTEM_LITTLEFS=y

The error messages lead me to believe that I made some sort of memory allocation error, but since this is only affecting one device I am a little confused. I have also noticed that the one affected device is draining the LiPo battery connected to the custom board quicker than the other devices. 

How can I debug this issue based on these error messages?

Thank you!

Crisvin Kadambathil 

  • Hello Sigurd, 

    You are right. I went back to making a copy of the nus_command->data using memcpy. Here is the code:

    int main(void)
    {
    	LOG_INF("Starting AWAREv4.0_AlphaV0.4");
    	
    	int err;
    	char cmd_buf[251] = {0};
    	char measTimeSec_buf[4] = {0};
    	char frequencyEIS_buf[4] = {0};
    	char filename_buf[100] = {0};
    	bool measOngoing = false;
    	bool fileDataCleared = false;
    	int batVoltage;
    	uint8_t adcMeas[30];
    
    	// Initiate peripherals, BLE, file system, workqueue, and work items
    	err = my_init();
    	if (err)
    	{
    		return err;
    	}
    
    	// Check file storage status: data in file/data not in file flag
    	struct fs_dirent entry;
    	err = fs_stat("/lfs/test.txt", &entry);
    	if (err)
    	{
    		fileDataCleared = true;
    	}
    
    	// Startup Sequence: 
    	// Turn on blue LED, reset pico, hibernate pico, turn off blue LED
    
    	dk_set_led_off(RED_LED);
    	dk_set_led_off(GREEN_LED);
    	dk_set_led_on(BLUE_LED);
    
    	uart_resume();
    
    	k_msleep(100);
    	uart_tx(uart, RESET_PICO, strlen(RESET_PICO), SYS_FOREVER_MS);
    	k_msleep(500);
    	uart_tx(uart, ALL_IN_HIBERNATE, strlen(ALL_IN_HIBERNATE), SYS_FOREVER_MS);
    	k_msleep(500);
    
    	uart_suspend();
    
    	dk_set_led_off(BLUE_LED);
    
    	for (;;)
    	{
    		// Wait for command from NUS
    		struct uart_data_t *nus_command;
    		nus_command = k_fifo_get(&nus_receive_data, K_FOREVER);
    
    		// Save command in command buffer
    		memcpy(cmd_buf, nus_command->data, nus_command->len);
    
    		k_free(nus_command);
    
    		switch (*cmd_buf)
    		{
    			// CMD 'o': Turn on blue LEDs
    			case 'o':
    				dk_set_led_off(RED_LED);
    				dk_set_led_off(GREEN_LED);
    				dk_set_led_on(BLUE_LED);
    
    				break;
    
    			// CMD 'i': Get information on device (hardwareID, fimrwareID, battery voltage, storage used)
    			case 'i':
    				dk_set_led_off(BLUE_LED);
    
    				// Send hardware and firmware ID
    				bt_nus_send(NULL, deviceHardwareID, strlen(deviceHardwareID));
    				bt_nus_send(NULL, annaFirmwareID, strlen(annaFirmwareID));
    
    				// Get and send battery voltage
    				get_bat_voltage();
    				k_msleep(100); 
    				batVoltage = get_bat_voltage(); 
    				sprintf(adcMeas, "Battery Voltage: %d mV\n", batVoltage);
    				bt_nus_send(NULL, adcMeas, strlen(adcMeas));
    
    				// Schedule littleFS_dir_handler() -> littleFS_dir_handler() will send storage used
    				k_work_schedule_for_queue(&file_work_q, &directory_work_item.d_work, K_NO_WAIT);
    
    				break;
    
    			// CMD 'e': Erase all data in file storage
    			case 'e':
    				dk_set_led_off(BLUE_LED);
    
    				// ERROR a!1001: This CMD is locked during ongoing measurements
    				if (measOngoing)
    				{
    					bt_nus_send(NULL, "a!1001\n", 7);
    					break;
    				}
    
    				// Schedule littleFS_erase_handler()
    				k_work_schedule_for_queue(&file_work_q, &erase_work_item.d_work, K_NO_WAIT);
    
    				fileDataCleared = true;
    
    				break;
    
    			// CMD 'x': Erase all data in file storage; override any checks for ongoing measurements
    			case 'x':
    				dk_set_led_off(BLUE_LED);
    
    				// Schedule littleFS_erase_handler()
    				k_work_schedule_for_queue(&file_work_q, &erase_work_item.d_work, K_NO_WAIT);
    
    				fileDataCleared = true;
    
    				break;
    
    			// CMD 'r': Read all data in file
    			case 'r':
    				dk_set_led_off(BLUE_LED);
    
    				// Schedule littleFS_read_handler()
    				k_work_schedule_for_queue(&file_work_q, &read_work_item.d_work, K_NO_WAIT);
    
    				break;
    
    			// CMD 'h': Read header data from file
    			case 'h':
    				dk_set_led_off(BLUE_LED);
    
    				// Schedule littleFS_read_header_handler()
    				k_work_schedule_for_queue(&file_work_q, &read_header_work_item.d_work, K_NO_WAIT);
    
    				break;
    
    			// CMD 'a': Start a continous measurement; measurement timer, EIS frequency, and file header are argument
    			case 'a':
    				dk_set_led_off(BLUE_LED);
    
    				// ERROR a!1001: This CMD is locked during ongoing measurements
    				if (measOngoing)
    				{
    					bt_nus_send(NULL, "a!1001\n", 7);
    					break;
    				}
    
    				// ERROR a!1002: This CMD is locked if there is already data in the file
    				else if (!fileDataCleared)
    				{
    					bt_nus_send(NULL, "a!1002\n", 7);
    					break;
    				}
    
    				parse_cmd_a(cmd_buf, measTimeSec_buf, frequencyEIS_buf, filename_buf);
    
    				// ERROR a!1003: This CMD has an inadaquate number of or incorrectly formatted arguments
    				if (measTimeSec_buf[0] == '\0' || frequencyEIS_buf[0] == '\0' || filename_buf[0] == '\0')
    				{
    					bt_nus_send(NULL, "a!1003\n", 7);
    					break;
    				}
    
    				measTimeSec = atoi(measTimeSec_buf);
    				frequencyEIS = atoi(frequencyEIS_buf);
    
    				// ERROR a!1004: The measurement rate cannot be less than 15 seconds
    				if (measTimeSec < 15)
    				{
    					bt_nus_send(NULL, "a!1004\n", 7);
    					break;
    				}
    				// ERROR a!1005: The EIS frequency cannot be less than 100 Hz
    				else if (frequencyEIS < 100)
    				{
    					bt_nus_send(NULL, "a!1005\n", 7);
    					break;
    				}
    
    				// Set the EIS frequency
    				snprintf(EIS_pt2_freq, 15, " %i %i 4 0m\n", frequencyEIS, frequencyEIS);
    
    				// Write file header to file: put header in fifo -> schedule littleFS_write_header_handler()
    				struct uart_data_t *store_buf;
    				store_buf = k_malloc(sizeof(*store_buf));
    
    				// make sure that k_malloc() does not return NULL ptr
    				if (store_buf)
    				{
    					snprintf(store_buf->data, CONFIG_BT_CTLR_DATA_LENGTH_MAX, ":%s;\n", filename_buf);
    					store_buf->len = strlen(store_buf->data);
    
    					k_fifo_put(&fifo_uart_tx_data, store_buf);
    					k_work_schedule_for_queue(&file_work_q, &write_header_work_item.d_work, K_NO_WAIT);
    
    					k_msleep(50);
    
    					measOngoing = true;
    					fileDataCleared = false;
    
    					// Initiate measurements: start measurement timer -> activate run_meas_thread()
    					write_work_item.meas_count = 0;
    					k_timer_start(&meas_timer, K_SECONDS(measTimeSec), K_SECONDS(measTimeSec));
    					k_thread_resume(run_meas_thread_id);
    					k_sem_give(&run_meas_sem);
    				}
    				// ERROR a!1008: Could not allocate memory
    				else
    				{
    					bt_nus_send(NULL, "a!1008\n", 7);
    					break;
    				}
    
    				break;
    
    			// CMD 'b': Stop an ongoing continous measurement
    			case 'b':
    				dk_set_led_off(BLUE_LED);
    				
    				bt_nus_send(NULL, "Measurements Stopped\n", 21);
    
    				measOngoing = false;
    
    				// Stop measurements: stop timer -> suspend run_meas_thread()
    				k_timer_stop(&meas_timer);
    				k_thread_suspend(run_meas_thread_id);
    
    				break;
    
    			default:
    				break;
    		}
    
    		LOG_INF("Clearing CMD Buffers!");
    
    		measTimeSec_buf[0] = '\0';
    		frequencyEIS_buf[0] = '\0';
    		filename_buf[0] = '\0';
    		memset(cmd_buf, 0, sizeof(cmd_buf));
    
    		LOG_INF("Done Clearing CMD Buffers!");
    	}
    
    	return 0;
    }

    I think I also discovered what was causing the bus fault. In the previous response, I found out that the bus fault was occurring exactly at the "k_free()" line that freed the memory allocated during the bt_receive_cb callback. In the callback I was writing the data to the nus_command->data field by using "snprintf(nus_buf->data, CONFIG_BT_CTLR_DATA_LENGTH_MAX, "%s", data);" I changed this line to use memcpy instead based on an example I saw. Here is the bt_receive_cb after I changed it:

    static void bt_receive_cb(struct bt_conn *conn, const uint8_t *const data, uint16_t len)
    {	
    	struct uart_data_t *nus_buf;
    
    	nus_buf = k_malloc(sizeof(*nus_buf));
    	if (nus_buf == NULL) 
    	{
    		// Unable to allocate memory from the heap
    		LOG_ERR("Unable to allocate memory (bt_receive_cb)");
    		return;
    	}
    
    	// Changed for testing
    	memcpy(nus_buf->data, data, len);
    	//snprintf(nus_buf->data, CONFIG_BT_CTLR_DATA_LENGTH_MAX, "%s", data);
    	nus_buf->len = len;
    	
    	k_fifo_put(&nus_receive_data, nus_buf);
    }

    This change from snprintf to memcpy has seemed to fix all the crashing issues. I even noticed that switching back to snprintf from memcpy would directly result in the bus fault.

    I am still a little confused as to why this problem didn't manifest earlier when I was developing this code, or why the problem with snprintf only seems to cause issues on some devices some of the times. However, using memcpy appears to have fixed the issue completely. Is this behavior expected?

    Thank you!

  • Crisvin_K said:
    Is this behavior expected?

    I dont know from the top of my head. I suspect it is something specific to the snprint function, but after looking a bit around I cannot find anything that stands out. Since using memcpy works for you, I do not think I will spend more time trying to find out exactly what happened here.

Related