nRF5340 Missing Disconnect Event (Supervision Timeout)

Hi,

I have a nRF5340 using hci_ipc that has the following features:

Central Role + Scanning (Multi-connection to nRF52840 peripherals)
Peripheral Role
QoS Channel Scanning
Nordic Uart (Server + Client)

I am missing bluetooth supervision timeout events with the nRF53 with both ncs 2.7.0 and the upstream Zephyr BLE stack. (No disconnected callback using the Softdevice).

Problem Flow from Central perspective:
1) Scan and connect
2) Subscribe to NUS (cached handles - no discovery)
3) MTU Exchange
4) Send NUS data back and forth (~1.2Kb DTLS handshake)
5) Disconnect Peripheral with pending Central NUS GATT write (breakpoint peripheral nrf52 / power off / reset)
6) Observe no supervision timeout (4s) [This event expected, but doesn't happen]
7) Observe GATT write error 30s later
8) GATT error cleans up nus subscription, manually issue disconnect here
9) Observe no .disconnected callback

Setups used:

1) Initially we were on NCS 2.7.0 with the included hci_ipc using the ipc_radio sample with a few upstream network (not bluetooth) related cherrypicks. We had problems with this configuration missing disconnect events.
2) Now we're on a heavily cherrypicked SDK to get as close to Zephyr main upstream Bluetooth stack (from the last couple days) & the newest nrfx-lib softdevice+hci_ipc. We still had problems with this.
3) With the heavily cherrypicked SDK, we enabled SPLIT_SW_LL and do not miss these supervision timeout events with the same application code.

I am curious what layer these disconnect events are supposed to get propagated through `zephyr/drivers/bluetooth/ipc/ipc.c` when using `hci_ipc`.

I have not yet been able to make a minimum viable example, but am working towards this.

I made a Zephyr discord post regarding this on #nordic for additional context, which I tried to summarize here.
https://discord.com/channels/720317445772017664/883445320812466209/1283513590606860318

Let me know if I can provide additional information,
Jeff

Top Replies

Parents

0 Vidar Berg over 1 year ago

Hi Jeff,

Thank you for the detailed summary of your setup. This helped me create minimal peripheral and central projects based on our peripheral and central UART, allowing me to reproduce the issue within a couple of minutes (GATT timeout followed by a missed disconnect event). I have handed over the projects to our Bluetooth team for further investigation. Regarding the Zephyr controller, it uses two RX threads with one high priority thread, which may unblock the host. I will keep you updated on the progress.

Thank you,

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 JeffW over 1 year ago in reply to Vidar Berg

Vidar,

That's great news!
Is it possible to share the sample projects? (No need to spend time cleaning up code)

We're more than happy to test and debug any potential solutions as this is our biggest blocker for customer sampling.

Very much appreciate the help,
--Jeff
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 1 year ago in reply to JeffW

Jeff,

Does the peripheral use different addresses for its advertisements? If not, what could be triggering the central device to reconnect to the peripheral when it has not received a disconnect callback? Another approach might be to use write or notify with callbacks and trigger a disconnect if the callback is not received within a certain time. I would not expect this to have a noticeable impact on your power budget.

As I understand from the developers, the proper fix for this will likely require additional changes to the controller itself.

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 JeffW over 1 year ago in reply to Vidar Berg

Vidar,

I think there's a small misunderstanding.
The Central device wrongly stays in a "connected state" when the link should have been terminated due to missing Link Layer packets.

1) Central connects to peripheral
2) DTLS over NUS until handshake is completed. NUS data stops. Only thing going is LL packets.
3) Battery is pulled on peripheral device (proxy for real world disconnect due to range)
4) Supervision timeout does not hit after 6 seconds. Wait around 30s-1 minute at this point.
5) Battery is replaced, device advertises (with same address)
6) Central tries to reconnect to the device but bt_conn_le_create fails with -22 because bt_conn_exists_le fails and prints "<wrn> ble_conn: Found valid connection (0x2000f540) with address F1:61:8:6E:9A:B2 (random) in connected state.

This prevents reconnecting to the peripheral device because internally there is still a conn found by bt_conn_lookup_addr_le where the state is still connected in some combination of the host/controller, even though it has not gotten any connection LL packets past the supervision timeout.

The nRF52 peripheral correctly gets the supervision timeout in all real world (non-battery-pull) cases.

I appreciate the suggestions, and some methods like failing a GATT write from the controller might cause a disconnect after a timeout (30s, but recently configurable iirc), but we're in a bluetooth proximity tethering application that requires being fast acting, so actually getting the supervision timeout disconnected event is ideal.

Happy to clarify further, also available on the Zephyr Discord (Jeff Welder) if that line of communication proves more efficient,
--Jeff
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 1 year ago in reply to JeffW

Jeff,

JeffW said:
I think there's a small misunderstanding.
The Central device wrongly stays in a "connected state" when the link should have been terminated due to missing Link Layer packets.

I was wondering what prompted the central to reconnect when it still believed it was connected to the peripheral. However, I had a single link GAP central in mind when I asked this. For a multi-link central, you would continue scanning after the connection has been established, so I think this part makes sense now.

As a workaround, would it make sense to check if you already have a valid connection for a given address before calling bt_conn_le_create()? This seems like an easy way to detect a missed disconnect event. You could then call bt_conn_disconnect() on the stale connection object to attempt to free it before establishing the connection again.

JeffW said:
Only thing going is LL packets.

The flow control bug we attempted to address in v2.8.0 should only occur when the stack needs to wait on resources (buffers) to be freed and is not expected to happen during low BLE traffic.

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 JeffW over 1 year ago in reply to Vidar Berg

Vidar,

I appreciate the suggestions, but both the central and peripheral are multi-connection. Unfortunately due to this topology, just because the peripheral is advertising again does not mean that the central missed a disconnect event.

The connection pointer needs to have the correct state. I don't know how the supervision timeout does not get processed when almost nothing besides scanning is happening.

For the time being we're using the Split LL again, but ideally we move back to Nordic for a more familiar path to SIG certification.

--Jeff
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 1 year ago in reply to JeffW

Jeff,

It is surprising that this occurs when almost nothing is happening besides scanning. The issues we found earlier would only occur during high BLE activity. I have reported this as new issue internally. I also found out that we already include tests that simulate sudden disconnects by resetting the peer device. However, it seems we are not able to replicate the same conditions seen with your application.

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Vidar Berg over 1 year ago in reply to JeffW

Jeff,

It is surprising that this occurs when almost nothing is happening besides scanning. The issues we found earlier would only occur during high BLE activity. I have reported this as new issue internally. I also found out that we already include tests that simulate sudden disconnects by resetting the peer device. However, it seems we are not able to replicate the same conditions seen with your application.

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 JeffW over 1 year ago in reply to Vidar Berg

Vidar,

My colleague was able to create a barebones sample to demonstrate this missing disconnect.

We modified your sample in these ways-
Peripheral:
On send enabled (this is after MTU update), we send 1000 bytes of zeros.
Central responds with 1000 bytes of ones.
Repeats sending ones/twos/threes.
After we receive the last packet, we update the connection interval parameters and go idle.

Central:
Switch to IPC_RADIO
No auto connect on filter match
Vendor specific connection event report is enabled
Scanning is only disabled while connecting (code in main.c does not support multiple connections, but it is modeled after our application)

Inside the zip is a readme.
Central was a nRF5340dk
Peripheral was a nRF52840dk

MissingDisconnectModifiedSamples.zip

[00:00:23.747,192] <dbg> bt_conn: tx_notify_process: conn 0x200018c8
[00:00:23.747,222] <dbg> bt_conn: tx_notify_process: tx 0x2000297c cb 0 user_data 0
[00:00:23.747,222] <dbg> bt_conn: tx_free: 0x2000297c

-- NOT ORIGINAL LOG --
-- BEGIN WINDOW WHERE I TURN OFF POWER TO 52840DK PERIPHERAL --

[00:00:23.747,253] <dbg> bt_conn: tx_notify_process: raise TX IRQ
[00:00:23.747,283] <dbg> bt_conn: bt_conn_tx_processor: start
[00:00:23.747,283] <dbg> bt_conn: bt_conn_tx_processor: no connection wants to do stuff
[00:00:24.047,241] <dbg> bt_conn: bt_conn_ref: handle 10 ref 2 -> 3
[00:00:24.047,271] <dbg> bt_conn: bt_conn_unref: handle 10 ref 3 -> 2

-- NOT ORIGINAL LOG --
-- END WINDOW --
-- IF DISCONNECTED, TURN ON AND TRY AGAIN, VARY WHEN POWER IS TURNED OFF IN THIS WINDOW --
-- IF UNABLE TO REPRODUCE, TRY SLIGHTLY BEFORE OR AFTER WINDOW--

[00:09:06.285,400] <inf> central_uart: Filters matched. Address: D4:BC:F8:83:7F:87 (random) connectable: 1
[00:09:06.285,583] <dbg> bt_conn: bt_conn_tx_processor: start
[00:09:06.285,583] <dbg> bt_conn: bt_conn_tx_processor: no connection wants to do stuff
[00:09:06.289,550] <dbg> bt_conn: bt_conn_ref: handle 10 ref 2 -> 3
[00:09:06.289,672] <wrn> bt_conn: Found valid connection (0x200018c8) with address D4:BC:F8:83:7F:87 (random) in connected state 
[00:09:06.289,672] <dbg> bt_conn: bt_conn_unref: handle 10 ref 3 -> 2
[00:09:06.289,703] <dbg> bt_conn: conn_le_create_common_checks: Conn check failed: ACL connection already exists.
[00:09:06.289,703] <err> central_uart: bt_conn_le_create failed (err -22)

--Jeff

0 Vidar Berg over 1 year ago in reply to JeffW

Jeff,

Thank you very much for creating this sample. We will use it to reproduce the issue on our end and hope to identify the root cause soon.

Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 JeffW over 1 year ago in reply to Vidar Berg

Hey Vidar,

Not expecting any solutions -- just curious if any teams were able to reproduce the problem given the sample, or not yet. :)

--Jeff
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 1 year ago in reply to JeffW

Hi Jeff,

I have requested a status update on this issue.

--Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Vidar Berg over 1 year ago in reply to Vidar Berg

Hi Jeff,

Sorry for the delayed response. Both R&D and I have made several attempts to reproduce the issue with the missed disconnect, but without any luck. Were you able to reproduce this fairly consistently on your side?

Below is the central log from the nRF5340 after having reset the 52840 peripheral. Since the scanner is always active in this sample and the peripheral starts advertising immediately after a reset before the supervision timer on the central has expired, I do see repeated warnings showing that the peripheral is already connected before the link is actually terminated by the supervision timeout.

--Vidar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel