Crash on connect with NCS 2.6.0

You can find all information about the issue and how to produce it on GitHub: github.com/.../ncs-2.6.0-connect-crash

Parents
  • Hi

    Okay, the BUS error seems to come from the SoftDevice handler, and after some discussing with colleagues, we are back at the UART likely being the blocker here and it stops the SoftDevice controller from handling the MTU exchange and callbacks correctly. I reproduced it here as well, but have not had time to test this out yet on my end. Try moving uart_rx_enable() into a workqueue as is done in the peripheral_uart sample, with the following:

    k_work_init_delayable(&uart_work, uart_work_handler);
    
    static void uart_work_handler(struct k_work *item)
    {
        struct uart_data_t *buf;
     
        buf = k_malloc(sizeof(*buf));
        if (buf) {
            buf->len = 0;
        } else {
            LOG_WRN("Not able to allocate UART receive buffer");
            k_work_reschedule(&uart_work, UART_WAIT_FOR_BUF_DELAY);
            return;
        }
     
        uart_rx_enable(uart, buf->data, sizeof(buf->data), UART_WAIT_FOR_RX);
    }

    Best regards,

    Simon

  • first: We don't need a workaround, because we are willing to wait a little bit for a proper fix. We simply don't want to be stuck with 2.5.1 forever(where the bug doesn't exist).

    I implemented your suggested change and pushed it to GitHub. It did not make a difference:

    [00:00:00.012,939] <inf> main: BT ready
    [00:00:00.037,963] <dbg> bt_hci_core: bt_recv: buf 0x200187ec len 68
    [00:00:00.037,994] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
    [00:00:00.038,024] <dbg> bt_hci_core: rx_work_handler: buf 0x200187ec type 1 len 68
    [00:00:00.038,024] <dbg> bt_hci_core: hci_event: event 0x3e
    [00:00:00.038,055] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x08
    [00:00:00.038,085] <dbg> bt_ecc: bt_hci_evt_le_pkey_complete: status: 0x00
    [00:00:01.000,488] <inf> main: UART initialized
    [00:00:05.054,046] <dbg> bt_hci_core: bt_recv: buf 0x200187ec len 33
    [00:00:05.055,511] <err> os: ***** BUS FAULT *****
    [00:00:05.055,511] <err> os:   Imprecise data bus error
    [00:00:05.055,541] <err> os: r0/a1:  0x0014043e  r1/a2:  0x200125bf  r2/a3:  0x00000006
    [00:00:05.055,572] <err> os: r3/a4:  0x0014043f r12/ip:  0x00000003 r14/lr:  0x00004e27
    [00:00:05.055,572] <err> os:  xpsr:  0xa1000000
    [00:00:05.055,572] <err> os: Faulting instruction address (r15/pc): 0x0000bb74
    [00:00:05.055,633] <err> os: >>> ZEPHYR FATAL ERROR 26: Unknown error on CPU 0
    [00:00:05.055,664] <err> os: Current thread: 0x20012958 (MPSL Work)

  • Sorry. I forgot to mention, but I also changed your timeout for the UART. Try to increase it from 100 to at least 1000. I would also recommend that you change the memory[100] buffer with a buffer defined outside your work function, so it is valid after this work function returns.

    I am not sure why it fails when the timeout is 100µs. I have forwarded this to our SoftDevice controller team.

    Best regards,

    Edvin

  • Hello,

    Edit:

    After talking to our SoftDevice team, they said that they are not able to reproduce the issue on the current main branch (which will turn into 2.7.0 at some point), but they are not yet sure what caused the issue. They are still looking into it. In the meantime, for further development, you can increase the uart inactivity timeout from 100µs to 1000µs, until we find the cause for the issue.

    Best regards,

    Edvin

  • I can still reproduce the issue (of the peripheral crashing) if I use 2.6.0(or 2.6.2) for the central and 2.7.0 for the peripheral. That means a slight difference in the behavior of the central just makes it less likely for the issue to appear, but it's not fixed yet, unfortunately.

  • Hello,

    Thanks for coming back to us. I remember this issue. I see that there is no update on my internal ticket, but I will ping it. Can you please let me know whether you needed to do any changes to the application to reproduce it using 2.7.0? And just out of curiousity - Does it matter if the central is using 2.7.0 as well? Still reproducible? I am not saying that it is a fix to make sure that both are 2.7.0, but it can be a place to start to know whether it is reproducible when the central is 2.7.0 or not. 

    Best regards,

    Edvin

  • All I had to do was to change the nRF SDK version in west.yml and run `west update`.

    For the central, only 2.7.0 makes the bug disappear. With 2.6.0 and 2.6.2, it still happens.

Reply Children
No Data
Related