Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs

Out of range catastrophe

Hello everyone!

This thread is similar to Indoor BLE Range Improvements , as same system is used. There we discuss about S140 Soft Device scheduling conflict and here we want to talk about problem caused by device out of range. 

Background 

We are in development phase for indoor household appliances with many wireless connected entities. There is a requirement to support following topology: 8 battery powered devices and single mains powered device. As the target environment is indoor and low rate of data exchange (low bandwidth) is needed, we agree to test out BLE5 LE CODED PHY (S=8, 125kbit/s) aka. LongRange, using Minew MS88SF3 module that features nRF52840 chipset.

So, we prepare a simple mock-up test application to check out how the system will behave in real-life scenarios. Test application was very simple, central BLE device scans continuously and stops after all 8 connections are established. Peripheral BLE devices advertise when not in connection. With such approach system shall always converge to have all devices connected, as if one device gets disconnected advertising/scanning shall re-started and the drop connections should re-establish. Every second 64 bytes of dummy data were exchange between central and peripheral device based on server-initiate update (notification type). Tx power for advertising and connection was set to 8dBm. Connection interval was set to 1500ms, SlaveLatency to 2 and Supervision Timeout to 15000ms. Well, it works perfectly on a desk! 

BLE settings

Platform description:

  • IC:               nRF52840
  • Module:       Minew MS88SF3
  • SDK:            nRF5_SDK_17.1.0_ddde560
  • Softdevice:  s140_nrf52_7.2.0 
  • IDE:             SEGGER Embedded Studio for ARM Release 7.10a Build 2022121504.52072
  • OS:              Windows 10

"Out of range" problem

We done a couple of test and found out that:

  • If we take Device A (one of the 8 peripheral devices, all in connection) out of range, it disconnects. If then moved back to range, it connects back without any problems - NORMAL EXPECTED OPERATION,
  • If we take Device A (one of the 8 peripheral devices, all in connection) out of range, it disconnects. If we then move Device B (one of the 8 peripheral devices, all beside Device A in connection) out of range, it disconnects. Then Device B does not re-connect when moved back to range. Only after moving Device A back to range, both Devices A & B gets re-connected. It is like Device A is blocking Device B from re-connectiong, regardless Device B is in range! - ABNORMAL OPERATION,
  • We observe that only first disconnected device (due to out of range) blocks other from re-connecting. Moving first disconnected device back to range triggers all other devices to re-connect,
  • We suspect that Device A (goes out of range first) block Device B in context of advertisement, as both devices are peripheral. But that is not the case as we done separate tests to eliminate that possibility, where the first disconnected device did not start advertising at disconnection. Same effect was observed, other devices did not re-connect and thus conclude that advertisement do not play role in that effect, meaning that the problem lives on Central device.
  • If we disable data transmission on Central Device and Device A and repeat point 2., Device B gets re-connected when moved back to range. In that test case Device A did not blocked Device B from re-connection.

We done a couple of tests addressing "blocking problem" and there was a consistent outcome. Following picture shows the above described problem on real test mock-up system with 9 peripheral devices. For that test we disable data transmiting for Central device, Dev#6 and Dev#16. During that test following events takes place:

1. Dev#14 lost connection (not on purpose, might cause moving people around it, closing doors) and was automatically reconnected back - Not expected, but NO PROBLEM!
2. Dev#6 lost connection on purpose, to test that central device is working OK. Reconnected OK!
3. Dev#16 was moved out of range and gets disconnected - OK, expected!
4. Dev#6 remove battery to test if will reconnect on putting battery back - OK, RECONNECTS!
5. Repeate point 4. - RECONNECTS!
6. Repeate point 4. - RECONNECTS! --> Consistent reconnection-OK!
7. Moving Dev#16 back to range and device reconnects! It disconnects and reconnect 2x due to moving the device! - OK, expected!
8. Moving Dev#15 out of range. That device do not have tx disabled. Device disconnects. - OK, expected!
9. Repate point 4. - DOESN'T RECONNECTS! ABNORMAL BEHAVIOUR!
10. Moving Dev#15 back to range. It connects back! - OK, expected for Dev#15 to re-connect!
11. Dev#6 connect back right after Dev#15 reconnects! STRANGE BEHAVIOUR, as it Dev#15 blocked Dev#6 from re-connecting!
12. Dev#15 lost connection (not on purpose, might cause moving people around it, closing doors) and was automatically reconnected back. - Not expected, but NO PROBLEM!

All events are shown on the picture:

Therefore, following questions arise:

  1. Why does the Device A block Device B from re-connection as described at point 2. (As said, we think it is a Central Device issue)? What is the rational explanation for that?
  2. Why is there a different behaviour between point 2. & 3.? As transmission is the only difference it must be the source of problems, or?!
  3. How can we mitigate that "blocking problem", where Device A blocks Device B from re-connection within a valid range?
  4. Do you receive any similar reported problems? If so, how do they solve it?

Thank you for all the help!

BR, Žiga

 

Parents
  • Hi

    I've read through the case, and this behavior seems very strange indeed. Thank you for the thorough explanation! Can you explain a bit more on how the central device behaves when a peripheral is disconnected? It almost seems like when a device is disconnected, the central device will wait for that specific device to reconnect before allowing/searching for any other disconnected peripherals. 

    Does the central scan for each specific device using a whitelist, or does it put the disconnected devices in a queue for example so that device B will not be considered to connect before device A is reconnected? To me it doesn't really seem like an issue on the peripheral, but rather with how the central device handles disconnects and reconnections.

    Are you able to share a sniffer trace using a dedicated sniffer or the nRF Sniffer so we can see what messages are being sent over the air when this issue occurs. 

    Best regards,

    Simon

  • Hello,

    Can you explain a bit more on how the central device behaves when a peripheral is disconnected?

    As described, central device will start to scan right after disconnection of the peripheral device in order to obtain connection back:

    So, we prepare a simple mock-up test application to check out how the system will behave in real-life scenarios. Test application was very simple, central BLE device scans continuously and stops after all 8 connections are established. Peripheral BLE devices advertise when not in connection. With such approach system shall always converge to have all devices connected, as if one device gets disconnected advertising/scanning shall re-started and the drop connections should re-establish. Every second 64 bytes of dummy data were exchange between central and peripheral device based on server-initiate update (notification type). Tx power for advertising and connection was set to 8dBm. Connection interval was set to 1500ms, SlaveLatency to 2 and Supervision Timeout to 15000ms. Well, it works perfectly on a desk! 


    To me it doesn't really seem like an issue on the peripheral, but rather with how the central device handles disconnects and reconnections.

    Yes, we also thinks that the problem is on central side, as said:

    We suspect that Device A (goes out of range first) block Device B in context of advertisement, as both devices are peripheral. But that is not the case as we done separate tests to eliminate that possibility, where the first disconnected device did not start advertising at disconnection. Same effect was observed, other devices did not re-connect and thus conclude that advertisement do not play role in that effect, meaning that the problem lives on Central device.


    It almost seems like when a device is disconnected, the central device will wait for that specific device to reconnect before allowing/searching for any other disconnected peripherals.

    Exactly! That behaviour is repeatable and easily reproduced. But note that only when data transmission is enabled! As said:

    • If we take Device A (one of the 8 peripheral devices, all in connection) out of range, it disconnects. If then moved back to range, it connects back without any problems - NORMAL EXPECTED OPERATION,
    • If we take Device A (one of the 8 peripheral devices, all in connection) out of range, it disconnects. If we then move Device B (one of the 8 peripheral devices, all beside Device A in connection) out of range, it disconnects. Then Device B does not re-connect when moved back to range. Only after moving Device A back to range, both Devices A & B gets re-connected. It is like Device A is blocking Device B from re-connectiong, regardless Device B is in range! - ABNORMAL OPERATION,
    • We observe that only first disconnected device (due to out of range) blocks other from re-connecting. Moving first disconnected device back to range triggers all other devices to re-connect,
    • We suspect that Device A (goes out of range first) block Device B in context of advertisement, as both devices are peripheral. But that is not the case as we done separate tests to eliminate that possibility, where the first disconnected device did not start advertising at disconnection. Same effect was observed, other devices did not re-connect and thus conclude that advertisement do not play role in that effect, meaning that the problem lives on Central device.
    • If we disable data transmission on Central Device and Device A and repeat point 2., Device B gets re-connected when moved back to range. In that test case Device A did not blocked Device B from re-connection.


    Does the central scan for each specific device using a whitelist, or does it put the disconnected devices in a queue for example so that device B will not be considered to connect before device A is reconnected?

    Central device is using whitelist, but custom implemented one. It simply check for manufacturer data inside connectable advertisement packet and if that data match it tries to connect to that peripheral (look at the attached code inside "ble_c.c" line 1175: ble_c_evt_on_adv_report).

    There is no mechanism implemented to block devices to re-connect in any way. This was also verified in described test (look at the test point 2, 3, 4). In normal conditions (all devices in range) everythink is working as expected.


    Are you able to share a sniffer trace using a dedicated sniffer or the nRF Sniffer so we can see what messages are being sent over the air when this issue occurs.

    Unfortunately, we don't own BLE sniffer. Does nRF Sniffer support CODED PHY? I need to check it and will provide you with logged data. That will take some time for me, so expect to get answer beginning of next week.


    Here is also our sdk_config and LL BLE driver code in case you find out some obvious problems:

    ble_c.c

    ble_c.h

    4477.sdk_config.h

    I hope given information will be sufficient to start investigating of the problem. As we do not have any clue how to solve/mitigate that problem, we would be very glad to get any ideas/recommendations where to startSlight smile

    Thank you for all the help!

    BR, Žiga

Reply
  • Hello,

    Can you explain a bit more on how the central device behaves when a peripheral is disconnected?

    As described, central device will start to scan right after disconnection of the peripheral device in order to obtain connection back:

    So, we prepare a simple mock-up test application to check out how the system will behave in real-life scenarios. Test application was very simple, central BLE device scans continuously and stops after all 8 connections are established. Peripheral BLE devices advertise when not in connection. With such approach system shall always converge to have all devices connected, as if one device gets disconnected advertising/scanning shall re-started and the drop connections should re-establish. Every second 64 bytes of dummy data were exchange between central and peripheral device based on server-initiate update (notification type). Tx power for advertising and connection was set to 8dBm. Connection interval was set to 1500ms, SlaveLatency to 2 and Supervision Timeout to 15000ms. Well, it works perfectly on a desk! 


    To me it doesn't really seem like an issue on the peripheral, but rather with how the central device handles disconnects and reconnections.

    Yes, we also thinks that the problem is on central side, as said:

    We suspect that Device A (goes out of range first) block Device B in context of advertisement, as both devices are peripheral. But that is not the case as we done separate tests to eliminate that possibility, where the first disconnected device did not start advertising at disconnection. Same effect was observed, other devices did not re-connect and thus conclude that advertisement do not play role in that effect, meaning that the problem lives on Central device.


    It almost seems like when a device is disconnected, the central device will wait for that specific device to reconnect before allowing/searching for any other disconnected peripherals.

    Exactly! That behaviour is repeatable and easily reproduced. But note that only when data transmission is enabled! As said:

    • If we take Device A (one of the 8 peripheral devices, all in connection) out of range, it disconnects. If then moved back to range, it connects back without any problems - NORMAL EXPECTED OPERATION,
    • If we take Device A (one of the 8 peripheral devices, all in connection) out of range, it disconnects. If we then move Device B (one of the 8 peripheral devices, all beside Device A in connection) out of range, it disconnects. Then Device B does not re-connect when moved back to range. Only after moving Device A back to range, both Devices A & B gets re-connected. It is like Device A is blocking Device B from re-connectiong, regardless Device B is in range! - ABNORMAL OPERATION,
    • We observe that only first disconnected device (due to out of range) blocks other from re-connecting. Moving first disconnected device back to range triggers all other devices to re-connect,
    • We suspect that Device A (goes out of range first) block Device B in context of advertisement, as both devices are peripheral. But that is not the case as we done separate tests to eliminate that possibility, where the first disconnected device did not start advertising at disconnection. Same effect was observed, other devices did not re-connect and thus conclude that advertisement do not play role in that effect, meaning that the problem lives on Central device.
    • If we disable data transmission on Central Device and Device A and repeat point 2., Device B gets re-connected when moved back to range. In that test case Device A did not blocked Device B from re-connection.


    Does the central scan for each specific device using a whitelist, or does it put the disconnected devices in a queue for example so that device B will not be considered to connect before device A is reconnected?

    Central device is using whitelist, but custom implemented one. It simply check for manufacturer data inside connectable advertisement packet and if that data match it tries to connect to that peripheral (look at the attached code inside "ble_c.c" line 1175: ble_c_evt_on_adv_report).

    There is no mechanism implemented to block devices to re-connect in any way. This was also verified in described test (look at the test point 2, 3, 4). In normal conditions (all devices in range) everythink is working as expected.


    Are you able to share a sniffer trace using a dedicated sniffer or the nRF Sniffer so we can see what messages are being sent over the air when this issue occurs.

    Unfortunately, we don't own BLE sniffer. Does nRF Sniffer support CODED PHY? I need to check it and will provide you with logged data. That will take some time for me, so expect to get answer beginning of next week.


    Here is also our sdk_config and LL BLE driver code in case you find out some obvious problems:

    ble_c.c

    ble_c.h

    4477.sdk_config.h

    I hope given information will be sufficient to start investigating of the problem. As we do not have any clue how to solve/mitigate that problem, we would be very glad to get any ideas/recommendations where to startSlight smile

    Thank you for all the help!

    BR, Žiga

Children
  • Ziga Miklosic said:
    Central device is using whitelist, but custom implemented one. It simply check for manufacturer data inside connectable advertisement packet and if that data match it tries to connect to that peripheral (look at the attached code inside "ble_c.c" line 1175: ble_c_evt_on_adv_report).

    Okay, and there is the same manufacturer data on all devices, so it shouldn't matter what manufacturer data the advertising devices transmit? Since I don't know how this custom whitelist is implemented my thoughts go to the central scanning for data that the first disconnected device (A) is advertising, and thus doesn't really "look for" the device B. Any chance you can upload the file where the whitelist implementation is done so we can take a look at how you've implemented this?

    As of version 4.0.0 of the nRF Sniffer supports Coded PHY, so that should be fine.

    Best regards,

    Simon

  • Sorry for late response, I've been out of office on friday!

    Answers

    Okay, and there is the same manufacturer data on all devices, so it shouldn't matter what manufacturer data the advertising devices transmit? Since I don't know how this custom whitelist is implemented my thoughts go to the central scanning for data that the first disconnected device (A) is advertising, and thus doesn't really "look for" the device B.

    Yes and no. All peripheral devices advertise manufacturer data, that is true. The content of manufacturer data on does matter, as based on valid data values connection is being initiated. Please, look at the attached code inside "ble_c.c" line 1175: ble_c_evt_on_adv_report


    Any chance you can upload the file where the whitelist implementation is done so we can take a look at how you've implemented this?

      I have a feeling that you read my post/answers superficially, as I already upload code and pointed out where to look for whitelist implementation:

    Central device is using whitelist, but custom implemented one. It simply check for manufacturer data inside connectable advertisement packet and if that data match it tries to connect to that peripheral (look at the attached code inside "ble_c.c" line 1175: ble_c_evt_on_adv_report).


    As of version 4.0.0 of the nRF Sniffer supports Coded PHY, so that should be fine.

    OK, great! Then I will take time to repeate experiment and prepare sniffed files. This might take some time for me. In meanwhile I would be very glad if we can continue to search for the obvious reason.

    Thank you again for all the help. We really appreciate your help!

    BR, Žiga

  • Hello  ,

    Please find the sniffed BLE communication and test description (README.txt) in the attachement:

    ========================================================================
    File: OutOfRange_test_1__29_05_2023.pcapng
    Date: 29.05.2023, 09:52
    
    Test description:
    	Test was done with 1 Central and 2 Peripheral devices (DevA & B). 
    	
    	Test steps:
    	 1. Start all three devices. Both DevA&B gets connected.
    	 2. Remove battery from DeviceA -> gets disconnected
    	 3. Remove battery from DeviceB -> gets disconnected
    	 4. Putting battery back to DeviceB -> device re-connects - OK, AS EXPECTED
    	 5. Putting battery back to DeviceA -> device re-connects - OK, AS EXPECTED
    	 6. Moved DeviceA out of range -> device disconnected - OK, AS EXPECTED
    	 7. Remove battery from DeviceB -> gets disconnected - OK, AS EXPECTED
    	 8. Putting battery back to DeviceB -> device DO NOT re-connects - FAULTY BEHAVIOUR!
    	
    	
    MAC Addresses:
        Central: 		E2:1E:EA:19:77:C2
        Peripheral DevA: 	DD:5F:43:78:58:0E
        Peripheral DevB: 	CE:AA:75:A4:1C:F1
    
    Interesting packets:
     - No. 3392: Connection request from Central Device
     - No. 3394: First connection event
     - No. 4049: Removed battey from Device A and putting back in
     - No. 4066: Device A starts to advertise
     - No. 4096: Connection re-established (between Central and DevA)
     - No. 4259: Conection to DevA drops due to out of range
     
    
    
    ========================================================================
    
    
    
    

    OutOfRange_test_1__29_05_2023.pcapng 

    I also added packet comments to .pcapng file to easier track of major events. 

    I hope this will speed up solving the out of range problem! 

    BR, Žiga

Related