nRF5340 Missing Disconnect Event (Supervision Timeout)

Hi,

I have a nRF5340 using hci_ipc that has the following features:

  • Central Role + Scanning (Multi-connection to nRF52840 peripherals)
  • Peripheral Role
  • QoS Channel Scanning
  • Nordic Uart (Server + Client)

I am missing bluetooth supervision timeout events with the nRF53 with both ncs 2.7.0 and the upstream Zephyr BLE stack. (No disconnected callback using the Softdevice).

Problem Flow from Central perspective:
1) Scan and connect
2) Subscribe to NUS (cached handles - no discovery)
3) MTU Exchange
4) Send NUS data back and forth (~1.2Kb DTLS handshake)
5) Disconnect Peripheral with pending Central NUS GATT write (breakpoint peripheral nrf52 / power off / reset)
6) Observe no supervision timeout (4s) [This event expected, but doesn't happen]
7) Observe GATT write error 30s later
8) GATT error cleans up nus subscription, manually issue disconnect here
9) Observe no .disconnected callback

Setups used:

1) Initially we were on NCS 2.7.0 with the included hci_ipc using the ipc_radio sample with a few upstream network (not bluetooth) related cherrypicks. We had problems with this configuration missing disconnect events.
2) Now we're on a heavily cherrypicked SDK to get as close to Zephyr main upstream Bluetooth stack (from the last couple days) & the newest nrfx-lib softdevice+hci_ipc. We still had problems with this.
3) With the heavily cherrypicked SDK, we enabled SPLIT_SW_LL and do not miss these supervision timeout events with the same application code.


I am curious what layer these disconnect events are supposed to get propagated through `zephyr/drivers/bluetooth/ipc/ipc.c` when using `hci_ipc`.

I have not yet been able to make a minimum viable example, but am working towards this.

I made a Zephyr discord post regarding this on #nordic for additional context, which I tried to summarize here.
https://discord.com/channels/720317445772017664/883445320812466209/1283513590606860318

Let me know if I can provide additional information,
Jeff

Parents
  • Hi Jeff,

    Thank you for the detailed summary of your setup. This helped me create minimal peripheral and central projects based on our peripheral and central UART, allowing me to reproduce the issue within a couple of minutes (GATT timeout followed by a missed disconnect event). I have handed over the projects to our Bluetooth team for further investigation. Regarding the Zephyr controller, it uses two RX threads with one high priority thread, which may unblock the host. I will keep you updated on the progress.

    Thank you,

    Vidar

  • Vidar,

    That's great news!
    Is it possible to share the sample projects? (No need to spend time cleaning up code) Slight smile

    We're more than happy to test and debug any potential solutions as this is our biggest blocker for customer sampling.

    Very much appreciate the help,
    --Jeff

  • Jeff,

    Of course, please see attached. Note that I have only tested these with v2.6.1 thus far.

    deadlock_gatt_timeout.zip

    Best regards,

    Vidar

Reply Children
  • Perfect! Thanks for the quick response.

    I was able to reproduce it with v2.7.0 as well. 
    Note: nRF5340dk Central  --> nRF52840 Peripheral

    *** Booting nRF Connect SDK v2.7.0-5cb85570ca43 ***
    *** Using Zephyr OS v3.6.99-100befc70c74 ***
    ...
    [00:00:15.873,535] <inf> central_uart: 100 packets sent.
    [00:00:45.870,544] <err> bt_att: ATT Timeout for device D4:BC:F8:83:7F:87 (random)
    [00:00:45.870,605] <inf> central_uart: Received data

    --Jeff

  • Thanks for confirming that it also fails with v2.7.0. It seems we may have isolated the problem to inter-core communication, but this still needs to be confirmed. We have observed that the HCI disconnect event does not reach the BT receive thread, despite the netcore HCI trace showing that it was sent. I hope to have more information to share next week.

    -Vidar

  • Hi Jeff,

    Could you try disabling the HCI ACL flow control (CONFIG_BT_HCI_ACL_FLOW_CONTROL=n) in the Bluetooth host? This won't eliminate the problem, but it should help mitigate it.

    Vidar

  • Vidar,

    Right now our mitigation is still using LL_SW_SPLIT which sacrifices channel selection, and temperature compensated RSSI -- but doesn't seem to miss disconnect events.

    I have been tracking any new BLE Host PRs on github, but nothing fruitful so far.

    --Jeff

  • Jeff,

    Unfortunately, it is becoming clear that it will take some time to fully resolve the issues identified in the Bluetooth stack, and changes to the Softdevice controller may be required. However, I would still recommend that you consider trying the Softdevice controller without flow control in your application to see if it fixes the deadlock issue in your case.

    Best regards,

    Vidar