Crash on connect with NCS 2.6.0

m1cha over 1 year ago

You can find all information about the issue and how to produce it on GitHub: github.com/.../ncs-2.6.0-connect-crash

Top Replies

Edvin 7 months ago in reply to Edvin +1 verified

Hello, I have some good news! Our SoftDevice team have found the cause of this bug, and the fix is merged into our next release in v2.8.0. The issue was related to a lot of UART events being generated…

Parents

0 Simonr over 1 year ago
Hi

Okay, the BUS error seems to come from the SoftDevice handler, and after some discussing with colleagues, we are back at the UART likely being the blocker here and it stops the SoftDevice controller from handling the MTU exchange and callbacks correctly. I reproduced it here as well, but have not had time to test this out yet on my end. Try moving uart_rx_enable() into a workqueue as is done in the peripheral_uart sample, with the following:

k_work_init_delayable(&uart_work, uart_work_handler); static void uart_work_handler(struct k_work *item) { struct uart_data_t *buf; buf = k_malloc(sizeof(*buf)); if (buf) { buf->len = 0; } else { LOG_WRN("Not able to allocate UART receive buffer"); k_work_reschedule(&uart_work, UART_WAIT_FOR_BUF_DELAY); return; } uart_rx_enable(uart, buf->data, sizeof(buf->data), UART_WAIT_FOR_RX); }

Best regards,

Simon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

0 m1cha over 1 year ago in reply to Simonr

first: We don't need a workaround, because we are willing to wait a little bit for a proper fix. We simply don't want to be stuck with 2.5.1 forever(where the bug doesn't exist).

I implemented your suggested change and pushed it to GitHub. It did not make a difference:

[00:00:00.012,939] <inf> main: BT ready
[00:00:00.037,963] <dbg> bt_hci_core: bt_recv: buf 0x200187ec len 68
[00:00:00.037,994] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
[00:00:00.038,024] <dbg> bt_hci_core: rx_work_handler: buf 0x200187ec type 1 len 68
[00:00:00.038,024] <dbg> bt_hci_core: hci_event: event 0x3e
[00:00:00.038,055] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x08
[00:00:00.038,085] <dbg> bt_ecc: bt_hci_evt_le_pkey_complete: status: 0x00
[00:00:01.000,488] <inf> main: UART initialized
[00:00:05.054,046] <dbg> bt_hci_core: bt_recv: buf 0x200187ec len 33
[00:00:05.055,511] <err> os: ***** BUS FAULT *****
[00:00:05.055,511] <err> os:   Imprecise data bus error
[00:00:05.055,541] <err> os: r0/a1:  0x0014043e  r1/a2:  0x200125bf  r2/a3:  0x00000006
[00:00:05.055,572] <err> os: r3/a4:  0x0014043f r12/ip:  0x00000003 r14/lr:  0x00004e27
[00:00:05.055,572] <err> os:  xpsr:  0xa1000000
[00:00:05.055,572] <err> os: Faulting instruction address (r15/pc): 0x0000bb74
[00:00:05.055,633] <err> os: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0
[00:00:05.055,664] <err> os: Current thread: 0x20012958 (MPSL Work)

0 Edvin over 1 year ago in reply to m1cha

Hello,

Simon is out of office, and asked if I could look into your ticket.

So the MPSL issue is because of an error in the UART handler. Your UART handler was empty, so you probably didn't notice it being called, but it was being called with event->type = UART_RX_BUF_REQUEST. As you can see from uart.h:

	/**
	 * @brief Driver requests next buffer for continuous reception.
	 *
	 * This event is triggered when receiving has started for a new buffer,
	 * i.e. it's time to provide a next buffer for a seamless switchover to
	 * it. For continuous reliable receiving, user should provide another RX
	 * buffer in response to this event, using uart_rx_buf_rsp function
	 *
	 * If uart_rx_buf_rsp is not called before current buffer
	 * is filled up, receiving will stop.
	 */
	UART_RX_BUF_REQUEST,

So you need to provide a buffer via uart_rx_buf_rsp().

The reason you need to provide another buffer is that the uart is double buffered.

Try something like this:

uint8_t my_uart_buffer[100];

static void uart_callback(const struct device *const uart_dev, struct uart_event *const event,
			  void *const user_data)
{
    
    switch (event->type) {
        case UART_RX_BUF_REQUEST:
            LOG_INF("uart cb type UART_RX_BUF_REQUEST");
            uart_rx_buf_rsp(uart_dev, my_uart_buffer, sizeof(my_uart_buffer));
            break;
        default:
            LOG_INF("uart cb type %d", event->type);
            break;
    }

}

static void work_handler(struct k_work *work)
{
	ARG_UNUSED(work);

	int ret;

	ret = uart_callback_set(uart_dev, uart_callback, NULL);
	if (ret) {
		LOG_ERR("Failed to set UART driver callback: %d", ret);
		return;
	}

	//static uint8_t memory[100];
    ret = uart_rx_enable(uart_dev, my_uart_buffer, sizeof(my_uart_buffer), SYS_FOREVER_US);
	if (ret) {
		LOG_ERR("Failed to enable UART RX: %d", ret);
		return;
	}

	LOG_INF("UART initialized");
}

That worked in my case.

You may notice that I am just using the same buffer again, so this means that it will use the same buffer when the buffer is full. So if you receive data too fast, it may start overwriting the start of the buffer before you get to read it out. To work around this, either see how it is done in the peripheral_uart sample, or use a double buffer, where you change what buffer is being provided in the UART_RX_BUF_REQUEST event.

Best regards,

Edvin

0 m1cha over 1 year ago in reply to Edvin

1) It's interesting that it worked for you because for me, it doesn't. See the updated GitHub repository.

2) It's not true, that you have to provide a second buffer with uart_rx_buf_rsp. This is totally optional and only needed if you want to continue receiving after the current buffer has filled up.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 1 year ago in reply to m1cha

Sorry. I forgot to mention, but I also changed your timeout for the UART. Try to increase it from 100 to at least 1000. I would also recommend that you change the memory[100] buffer with a buffer defined outside your work function, so it is valid after this work function returns.

I am not sure why it fails when the timeout is 100µs. I have forwarded this to our SoftDevice controller team.

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 1 year ago in reply to Edvin

Hello,

Edit:

After talking to our SoftDevice team, they said that they are not able to reproduce the issue on the current main branch (which will turn into 2.7.0 at some point), but they are not yet sure what caused the issue. They are still looking into it. In the meantime, for further development, you can increase the uart inactivity timeout from 100µs to 1000µs, until we find the cause for the issue.

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m1cha 9 months ago in reply to Edvin

I can still reproduce the issue (of the peripheral crashing) if I use 2.6.0(or 2.6.2) for the central and 2.7.0 for the peripheral. That means a slight difference in the behavior of the central just makes it less likely for the issue to appear, but it's not fixed yet, unfortunately.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 m1cha 9 months ago in reply to Edvin

I can still reproduce the issue (of the peripheral crashing) if I use 2.6.0(or 2.6.2) for the central and 2.7.0 for the peripheral. That means a slight difference in the behavior of the central just makes it less likely for the issue to appear, but it's not fixed yet, unfortunately.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Edvin 8 months ago in reply to m1cha

Hello,

Thanks for coming back to us. I remember this issue. I see that there is no update on my internal ticket, but I will ping it. Can you please let me know whether you needed to do any changes to the application to reproduce it using 2.7.0? And just out of curiousity - Does it matter if the central is using 2.7.0 as well? Still reproducible? I am not saying that it is a fix to make sure that both are 2.7.0, but it can be a place to start to know whether it is reproducible when the central is 2.7.0 or not.

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m1cha 8 months ago in reply to Edvin

All I had to do was to change the nRF SDK version in west.yml and run `west update`.

For the central, only 2.7.0 makes the bug disappear. With 2.6.0 and 2.6.2, it still happens.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m1cha 8 months ago in reply to Edvin

I did some additional testing(and have updated the repository) and have some more info to share:

- In my previous post I was testing with peripheral_hr(from my repo, not the original) but I missed that the problem disappeared from peripheral_hr_coded(from my repo, not the original). Enabling CONFIG_BT_HCI_CORE_LOG_LEVEL_DBG made the problem reappear. That further hints at a race condition.

- While changing the UART timeout from 100us to 1000us makes the problem disappear from peripheral_hr it does NOT make it disappear from peripheral_hr_coded, which is interesting to say the least. That means that the suggested workaround does not work reliably, unfortunately.

- I have never been able to reproduce this with CONFIG_BT_EXT_ADV enabled, so this might be the real workaround here. I'm a little too scared to rely on this though, given that this might just be making the race condition harder to trigger, like the other changes I made.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin 8 months ago in reply to m1cha

Hello,

Sorry for the late replies. Someone in our SoftDevice team is looking into this, but they have not yet figured out the source of the issue.

Just wanted to let you know that this ticket is not forgotten or ignored. I will let you know as soon as I have more details.

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
+1 Edvin 7 months ago in reply to Edvin

Hello,

I have some good news! Our SoftDevice team have found the cause of this bug, and the fix is merged into our next release in v2.8.0. The issue was related to a lot of UART events being generated, which caused some issues with some internal buffers and parameters not being correctly cleared.

I believe v2.8.0 is right around the corner, so you can test the fix when it is released. (I am not certain it is part of the release candidates that are already available).

Best regards,

Edvin
Cancel
Vote Up +1 Vote Down

Sign in to reply

Reject Answer

Cancel