NCS Nordic UART service: BT RX HCI timeout

Hi!

I am working on a customer project where up to 20 BLE peripherals are connected to a single central device. The code of the central is based on the Multi-NUS sample from this Nordic-Blog-Post and has been updated to NCS Version v2.0.0.

The NUS client has been optimized for data throughput. Therefore, the following additions have been made to the nus_client.c file:

/* Heap region to hold TX context */
struct tx_ctx_t {
    struct bt_nus_client *nus_c;
    const uint8_t *data;
    uint32_t length;
};

K_HEAP_DEFINE(tx_ctx_heap, CONFIG_BT_MAX_CONN * CONFIG_BT_CONN_TX_MAX * sizeof(struct tx_ctx_t));

static void on_sent_wo_rsp(struct bt_conn *conn, void *user_data)
{
	// get tx context from user data
	struct tx_ctx_t *context = (struct tx_ctx_t*) user_data;

	// make a copy of volatile data that is required by the callback
	struct bt_nus_client *nus_c = context->nus_c;
	const void *data = context->data;
	uint16_t length = (uint16_t) context->length;

	// release allocated tx context
	k_heap_free(&tx_ctx_heap, user_data);

	if (nus_c->cb.sent) {
		nus_c->cb.sent(nus_c, 0, data, length);
	}
}

int bt_nus_client_send(struct bt_nus_client *nus_c, const uint8_t *data, uint16_t len)
{
	int err;

	if (!nus_c->conn) {
		return -ENOTCONN;
	}

	// allocate TX context	
	struct tx_ctx_t* context = k_heap_alloc(&tx_ctx_heap, sizeof(struct tx_ctx_t), K_NO_WAIT);
	if (context == NULL) {
		LOG_ERR("TX context allocation failed");
		return -ENOMEM;
	}
	context->nus_c = nus_c;
	context->data = data;
	context->length = len;

	// throughput optimized GATT write request
	err = bt_gatt_write_without_response_cb(nus_c->conn, nus_c->handles.rx, data, len, false, on_sent_wo_rsp, context);
	if (err) {
		// release allocated tx context
		k_heap_free(&tx_ctx_heap, context);
	}
	
	return err;
}

Here is my project Kconfig configuration file:

#
# Copyright (c) 2018 Nordic Semiconductor
#
# SPDX-License-Identifier: LicenseRef-Nordic-5-Clause
#

# C++ Support
CONFIG_CPLUSPLUS=y
CONFIG_NEWLIB_LIBC=y
CONFIG_STD_CPP17=y
#CONFIG_LIB_CPLUSPLUS=y

# Configure Serial driver (additional configuration is done in overlay-<usb/uart>.conf)
CONFIG_STDOUT_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_UART_INTERRUPT_DRIVEN=y
CONFIG_UART_LINE_CTRL=y

# Enable the BLE stack with GATT Client configuration
CONFIG_BT=y
CONFIG_BT_CENTRAL=y
CONFIG_BT_SMP=y
CONFIG_BT_GATT_CLIENT=y
CONFIG_BT_MAX_CONN=20
CONFIG_BT_MAX_PAIRED=20
CONFIG_BT_CONN_CTX=y

# Advanced BLE configuration
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_CONN_RSSI=y
CONFIG_BT_CTLR_TX_PWR_DYNAMIC_CONTROL=y
CONFIG_BT_USER_PHY_UPDATE=y
CONFIG_BT_CTLR_PHY_CODED=y

# Enable the BLE modules from NCS
#CONFIG_BT_LOG_LEVEL_DBG=y
CONFIG_BT_NUS=y
CONFIG_BT_SCAN=y
CONFIG_BT_SCAN_FILTER_ENABLE=y
CONFIG_BT_SCAN_UUID_CNT=1
CONFIG_BT_GATT_DM=y
CONFIG_HEAP_MEM_POOL_SIZE=16384

# BLE Scanner configuration
CONFIG_BT_SCAN_ADDRESS_CNT=20
CONFIG_BT_SCAN_NAME_CNT=1

# MTU configuration taken from throughput example (adapted)
CONFIG_BT_BUF_ACL_RX_SIZE=251
CONFIG_BT_BUF_ACL_TX_SIZE=251
CONFIG_BT_BUF_ACL_TX_COUNT=40
CONFIG_BT_L2CAP_TX_BUF_COUNT=40
CONFIG_BT_ATT_PREPARE_COUNT=2
CONFIG_BT_L2CAP_TX_MTU=247
CONFIG_BT_CTLR_RX_BUFFERS=2
CONFIG_BT_CTLR_DATA_LENGTH_MAX=251

# This example requires more workqueue stack
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=4096
CONFIG_BT_RX_STACK_SIZE=4096

# Enable bonding
CONFIG_BT_SETTINGS=y
CONFIG_FLASH=y
CONFIG_FLASH_PAGE_LAYOUT=y
CONFIG_FLASH_MAP=y
CONFIG_NVS=y
CONFIG_SETTINGS=y

# Config logger
CONFIG_LOG=y
CONFIG_USE_SEGGER_RTT=y
CONFIG_LOG_BACKEND_RTT=y
CONFIG_LOG_BACKEND_UART=n

# App specific config
CONFIG_BASE64=y

# Bootloader
CONFIG_BOOTLOADER_MCUBOOT=y


The central throws a Zephyr Kernel oops error, when data is sent to and response is received from more than 10 peripherals.

[00:02:10.709,259] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x00000000
[00:02:10.709,289] <err> os: r3/a4:  0x20019646 r12/ip:  0x00000000 r14/lr:  0x000351bf
[00:02:10.709,320] <err> os:  xpsr:  0x61000000
[00:02:10.709,320] <err> os: Faulting instruction address (r15/pc): 0x000351ca
[00:02:10.709,350] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:02:10.709,381] <err> os: Current thread: 0x20002c98 (BT RX)
[00:02:10.906,616] <err> fatal_error: Resetting system

Resolving the PC shows that the error is a HCI_CMD_TIMEOUT in function bt_hci_cmd_send_sync of the hci_core.c file.

  err = k_sem_take(&sync_sem, HCI_CMD_TIMEOUT);
  BT_ASSERT_MSG(err == 0, "k_sem_take failed with err %d", err);


I think the problem is related to a deadlock of the BLE driver. Is there something wrong with my code or configuration?


Best regards,

Thomas

  • Hi Thomas, 

    To isolate the issue if it's a memory overflow issue or not, could you try to reduce the rate of sending data from each of the peripheral down and check if you can handle above 10  connections ? 


    Could you also try to increase CONFIG_BT_CTLR_RX_BUFFERS to 10 for example ? 

    Please provide the information on: 

    - Connection interval

    - How long it would take for the error to occur ? 

  • Thanks for the quick reply, I have tested your suggestion. 


    I think it is good to describe the problem in more detail. The connected peripherals have a continuous data stream to the centrail, and the error occurs when a user-initiated request is sent from the central to the connected peripherals and they respond. The error occur imedently after the request is sent, I think when the response is received.

    To reduce the rate of sending data from each of the peripherals down, I have reduced the request rate by inserting a 200 ms delay between the requests.


    The Firmware runs on two different systems with different configurations. Here are the configuration and test results:


    System 1:

    • connection interval 75 ms (min & max)
    • packet generation with fixed rate of ~3.5 Hz
      • approx. 220 packets per minute
      • approx. 6 kbit/s per connection
    • scan parameters:
      • Intervall: 500 ms
      • Window: 40 ms
    • Test 1 - Reduced request rate: works with a request delay of 200 ms for 20 connections
    • Test 2 - CONFIG_BT_CTLR_RX_BUFFERS=10: does not work

    System 2:

    • connection interval 25 ms (min & max)
    • none deterministic packet generation
      • approx. 40 packets per minute
      • approx. 0.4 kbit/s per connection
    • scan parameters:
      • Intervall: 250 ms
      • Window: 31.25 ms
    • Test 1 - Reduced request rate: does not work (request delay of 200 ms)
    • Test 2 - CONFIG_BT_CTLR_RX_BUFFERS=10: does not work
    • Error occur more often than with system 1

    Or could it somehow be related with the context management in my multi-NUS service implementation. This function is called when a request is sent to the connected devices.

    int multi_nus_t::send_broadcast(uint8_t* buf, size_t len) {
        int err = 0;
    
        /* How many connections are there in the Connection Context Library */
        const size_t num_nus_conns = bt_conn_ctx_count(&conns_ctx_lib);
    
        LOG_DBG("Send Broadcast");
        for (size_t i = 0; i < num_nus_conns; i++) {
            const struct bt_conn_ctx* ctx = bt_conn_ctx_get_by_id(&conns_ctx_lib, i);
    
            if (ctx) {
                struct bt_nus_client* nus_client = (bt_nus_client*) ctx->data;
    
                if (nus_client) {
                    // copy payload so it persists throught the BLE stack
                    uint8_t* ble_data = (uint8_t*) k_malloc(len);
    
                    if (ble_data == NULL) {
                        LOG_ERR("Could not allocate broadcast data connection %d", i);
                        goto release_ctx;
                    }
                    memcpy(ble_data, buf, len);
    
                    err = bt_nus_client_send(nus_client, ble_data, len);
                    if (err) {
                        LOG_ERR("Failed to send data over BLE connection %d (err %d)", i, err);
                        // return the semaphore as sending was not successful
                        k_free(ble_data);
                    } else {
                        LOG_INF("Sent broadcast to connection %d: %d bytes", i, len);
                        k_msleep(200); // delay to reduce the request rate
                    }
                }
                release_ctx:
                bt_conn_ctx_release(&conns_ctx_lib, (void*) ctx->data);
            }
        }
        return err;
    }

    Do you have any further suggestions?

  • Hi Thomas, 
    I'm checking internally to see if R&D has any advice on this. 

    Have you checked if you use bt_gatt_write_without_response() do you have any issue ? 
    I suspecting there is a messing up when the central waiting for the response from each of the peripheral. 

    Please try with CONFIG_ASSERT=y so we have more log.

  • Thank you!

    The CONFIG_ASSERT option is already set in an overlay file for my debug build and the error above is from a debug build.

    # Optimize for debug
    CONFIG_DEBUG_OPTIMIZATIONS=y
    CONFIG_DEBUG_THREAD_INFO=y
    CONFIG_ASSERT=y
    
    # Enable Thread awareness for Debugging in Segger Ozone
    CONFIG_THREAD_ANALYZER=y
    CONFIG_THREAD_NAME=y
    


    I have tested bt_gatt_write_without_response() and it shows the same behaviour as bt_gatt_write_without_response_cb(). It only works with reduced request rate.

    Update:
    I have also tested disabling the BT scanner when maximum number of 20 connections are established. And added a log message when the tx_ctx get released in the on_sent_wo_rsp() method. Unfortunately, the app crashes in the same way. Here ist the log output.

  • Hi Thomas, 

    We are suspecting you are having the same issue as reported here: https://github.com/zephyrproject-rtos/zephyr/issues/25917#issuecomment-639348125
    I'm checking if there is any workaround. 
    Could you confirm that at 75ms interval , the problem is not reproducible ? 

Related