nRF52 dual role: while connected as central and scanning, connection as peripheral fails

We have an nRF52 with dual roles: peripheral for the connection to an app, and central, connecting to multiple third party peripheral devices.

We have come across a very weird issue - it took a while to reproduce, but here it is:

If our device as central is connected to a Holux GPS receiver, and it is currently scanning for more devices, and we then connect our app to our devices peripheral role, it fails. Well, actually it's worse: it does connect, but during the next few steps (MTU & parameter negotiation) suddenly disconnects with reason 0x08

What's odd is that everything else being the same, with a QStarz GPS receiver connected instead, this is absolutely no problem. The only difference being that the connection parameters are different (lower connection interval)

Holux:

min: 320 (400ms)
max: 520 (650ms)
latency: 0
sup_timout: 400 (4000ms)

They agree on 520 (650ms)

The Holux, btw has a Nordic Chipset.

Qstarz:

min: 16 (20ms)
max: 60 (75ms)
latency: 0
sup_timout: 400 (4000ms)

They agree on 60 (75ms)

our ble_gap_scan_params_t remain the same:

.active = 1,
.interval = MSEC_TO_UNITS(100, UNIT_0_625_MS),
.window = MSEC_TO_UNITS(30000, UNIT_10_MS),
.timeout = MSEC_TO_UNITS(50, UNIT_0_625_MS),

when we receive a connection on our peripheral role, we immediately (try to) stop scanning, and restart it once the connection is established, but this seems not to work as expected if we're connected with long connection intervals.

Sniffing the traffic with the nRF sniffer does not reveal anything special - our device simply stops responding in the middle of service discovery, about 1.2 seconds after the initial connection.
I've attached a sniffing session of the problem (2 successful connections while not scanning, one fails when scanning) and 2 sessions of connecting the holux accessory and the qstarz accessory.

Do you have any ideas what we should try or look into?

Thanks already!

3124.unleasshed-holux-app-connect-scan-disconnect-app-reconnect-fail.pcapng

1070.unleashed-holux.pcapng

2133.unleashed-qstarz.pcapng

Parents

0 Einar Thorsrud over 4 years ago

Hi,

Looking at unleasshed-holux-app-connect-scan-disconnect-app-reconnect-fail.pcapng I see the intended disconnect in #3228, and then a re-connection, data exchange, and a new sudden disconnect at #3486. Looking at the packets before there you see that there is a lot of retransmissions, so the disconnect here with reason 8 (timeout) is expected given that. There is no telling why there are all these retransmissions, though. Are you able to debug both devices? It seems clear that the peripheral stopped receiving and acknowledging the packets. Perhaps it is in a bad state? The sniffer trace does not give information about this, though I suggest you debug on the peripheral side to see what state it is in.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 thefool over 4 years ago in reply to Einar Thorsrud

Hi,

yes, exactly. That's what we were observing too. Our device is not showing any signs of getting stuck, so we're assuming it's the soft device. After all it stops responding in the middle of service discovery. Can we poll the softdevice state for debugging? Any specific info we should try to get?

The odd thing is that this only happens if our device is simultaneously connected as central with a 650ms connection interval and is scanning as central while we're trying to connect to its peripheral instance.

If it's connected as central with a 75ms connection interval, and is scanning, then there's no problem!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 thefool over 4 years ago in reply to Einar Thorsrud

Hi,

yes, exactly. That's what we were observing too. Our device is not showing any signs of getting stuck, so we're assuming it's the soft device. After all it stops responding in the middle of service discovery. Can we poll the softdevice state for debugging? Any specific info we should try to get?

The odd thing is that this only happens if our device is simultaneously connected as central with a 650ms connection interval and is scanning as central while we're trying to connect to its peripheral instance.

If it's connected as central with a 75ms connection interval, and is scanning, then there's no problem!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Einar Thorsrud over 4 years ago in reply to thefool

Hi,

thefool said:
Our device is not showing any signs of getting stuck, so we're assuming it's the soft device.

I am confused about this. By "our device", so you mean the central (in the unleasshed-holux-app-connect-scan-disconnect-app-reconnect-fail.pcapng)? And by "soft device" do you mean the peripheral as a whole, or do you mean the SoftDevice as in the BLE stack we provide?

thefool said:
Can we poll the softdevice state for debugging? Any specific info we should try to get?

The SoftDevice generates events for most scenarios, and also use asserts for most unthinkable scenarios so it is very rare that it files silently. This is the default behavior and there is no special "debug mode". I suggest you enable logging on the peripheral to see what happens there. Generally in the vast majority of cases the issue is in the application code and not in the SoftDevice. The first step needs to be to debug the behavior on the peripheral side (which for some reason has stopped to acknowledge packets).

thefool said:
The odd thing is that this only happens if our device is simultaneously connected as central with a 650ms connection interval and is scanning as central while we're trying to connect to its peripheral instance.

If it's connected as central with a 75ms connection interval, and is scanning, then there's no problem!

Is it just in this specific connection interval (650 ms) it fails, or is there a more general tendency here? Generally, the SoftDevice will drop some packets now and then when there are collisions. That is controlled by the SoftDevice scheduling priorities and so the connection should not timeout. Can you try to reduce the scan window size and and perhaps also the frequency and see if that helps? Normally that should not be needed though, so I am not very optimistic this would.

Did you make the peripheral here, so that you can debug it? As mentioned, that seems like the obvious way forward. If that is not possible (so that the peripheral - where the failure seems to be - is a black box), then the approach would have to be mostly trial and error.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 thefool over 4 years ago in reply to Einar Thorsrud
Just to clarify: This is our setup:

A:iOS App (central) -> B:Unleashed (dual role) -> C:Holux (peripheral)
or
A:iOS App (central) -> B:Unleashed (dual role) -> D:QStarz (peripheral)

The problem is only with the connection between A and B's peripheral role!

But the odd thing is that it is only problematic when B's central role is connected to C and B is scanning.
There is no problem when B is connected to D, and it is scanning,
nor is there a problem when B is not connected to C or D, and is scanning or not
nor if it is connected to either C or D, but not scanning.

The only difference between C and D is the unusually long connection interval of C!

We have full control over A and B, including debug output, but not over C or D. But I don't think there is anything to debug here - as the B->C or B->D connections work flawlessly - but C has a weird influence on the connection A->B.

Here's Debug output of the Events:

ble_stack.c : 525: 38734: BLE event: 29
ble_stack.c : 525: 38735: BLE event: 29
ble_stack.c : 525: 38946: BLE event: 29
ble_stack.c : 525: 38947: BLE event: 29
ble_stack.c : 525: 39006: BLE event: 57
ble_stack.c : 525: 39798: BLE event: 29
ble_stack.c : 525: 39799: BLE event: 29
ble_stack.c : 525: 39894: BLE event: 29
ble_stack.c : 525: 39894: BLE event: 29
ble_stack.c : 525: 39901: BLE event: 29
nrf_ble_gatt.c : 154: 39952: Requesting to update ATT MTU to 247 bytes on connection 0x3
nrf_ble_gatt.c : 77: 39952: Updating data length to 251 on connection 0x3.
ble_conn_params.c : 163: 39952: Connection interval (handle 3) not acceptable actual max 30, preferred min 10 max 15
ble_stack.c : 282: 39952: Connection parameters rejected - will retry negotiation
ble_stack.c : 525: 39952: BLE event: 16
ble_stack.c : 426: 39952: PERIPHERAL: Connected
ble_stack.c : 427: 39952: Connection interval: 30 ms
accessories_core.c : 174: 39952: Scanning started

unleashed_ble.c : 102: 39952: App connected
app_state.c : 195: 39955: App state changed to BACKGROUND

ble_srv_utils.c : 225: 39959: Updating system attributes [0x00]
nrf_ble_gatt.c : 307: 40029: Data length updated to 251 on connection 0x3
nrf_ble_gatt.c : 311: 40029: Max RX octets: 251
nrf_ble_gatt.c : 313: 40029: Max TX octets: 251
nrf_ble_gatt.c : 315: 40029: Max RX time: 2120
nrf_ble_gatt.c : 317: 40029: Max TX time: 2120
ble_stack.c : 525: 40029: BLE event: 36
ble_stack.c : 525: 40082: BLE event: 29
ble_stack.c : 525: 40083: BLE event: 29
nrf_ble_gatt.c : 214: 40155: ATT MTU updated to 185 bytes on connection 0x3 (response).
ble_stack.c : 165: 40155: GATT ATT MTU changed to 185
ble_stack.c : 525: 40155: BLE event: 58
ble_stack.c : 525: 40250: BLE event: 17
ble_stack.c : 437: 40250: PERIPHERAL: Disconnected
ble_stack.c : 452: 40250: - UNKNOWN REASON [0x08]

From the Sniffing session, we can see that B's peripheral role stops responding in the middle of service discovery (initiated by A). The responses to Service discovery is handled entirely by the soft device, is it not? Can we get any debug output of an "internal state" during service discovery? There are many steps involved, so I'm sure you're going through a state machine...

I can try edit the scanning parameters, but since they work perfectly when connected to D, I can't see how that would help.

I would understand problems with dropped packets if the connection interval to C would be the shortest possible, and we're be maximising data transfer there, and would be scanning at the same time. Then I get that the radio would be at capacity. But the odd thing is that the really long interval seems to have the negative effect, whereas it works perfectly with a shorter interval...
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud over 4 years ago in reply to thefool

Hi,

I agree it is interesting that this seems worse with longer connection interval, and also that you only see issues in the link between A and B when B is connected to C and not D. There is still not much concrete to go on, but there are a few things worth checking out.

From this log everything looks good until you get a disconnect due to timeout (0x08) on the peripheral in this link (device B). The events you log here show nothing else that seems relevant. This makes me wonder - what is the 32.768 clock source? Is it RC or crystal? And how have you configured it in SW (including accuracy in PPM)? If you use crystal, can you share your exact crystal part number and which load cap values you use? This may be a long shot, as if there was an issue here I would have expected the communication between B and C or B and D to show issues as well, but that does not seem to be the case (if it is from the same time period).

Which SDK version and SoftDevice version do you use?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 thefool over 4 years ago in reply to Einar Thorsrud
We're using a Module with a built-in XTAL. It's a crystal with max 40 ppm, but I don't know the part number or load caps.

sdk_config.h:

NRF_SDH_CLOCK_LF_SRC 1 //XTAL

NRF_SDH_CLOCK_LF_ACCURACY 5 //50ppm

We're using SDK 17.0 and SD s132_nrf52_7.0.1
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 thefool over 4 years ago in reply to thefool

I think I actually just managed to work around the bug by stopping scanning as soon as possible when we get a connection to our peripheral role.

Still I'd like to understand if this is a bug in the softdevice, and/or what the reasoning behind this is.

I guess it would be worth, checking if you can reproduce the setup with example projects.
Instead of our Holux: One Blinky with preferred connection parameters: min: 320 (400ms), max: 520 (650ms), latency: 0, sup_timout: 400 (4000ms).
Instead of our Unleashed: one dual role with preferred connection parameters: 8(10ms)-12(15ms), 2mbps PHY and MTU/Datalength: 247/251
Instead of our App: nRF Connect on an iPhone.

I already tested using nRF connect with our devices and I see the same error, so it has nothing to do with our app.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel