This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Central device claims to have Connected but no Connection event is fired on the peripheral

I'm developing a multirole device to connect to a peripheral device that we developed earlier.

The multirole device when acting as a central device it scans for the peripheral and if found it connects, then secure the connection and bond. After several successful connections, the central says that it connects to the peripheral but before being able to secure the connection or to discovery the services it disconnects.

I've enabled the logger on both devices and added logs to see the progress of the connection and the events.

I've seen that when the problem happens, the central is showing me the BLE_GAP_EVT_CONNECTED but the peripheral doesn't show anything. it's like the central device is firing a false BLE_GAP_EVT_CONNECTED event.

I've tried with the scan_params.extended on and off and the behavior is the same.

I've confirmed that when connected from another central that works fine (using the phone with the nrfConnect application), the peripheral shows the events as expected.

Both devices are using the nrf52832

The peripheral is using the SDK 15.0.0 and the softdevice s132 6.0.0

The device acting as a central is using the SDK 16.0.0 and the softdevice s132 7.0.1

Parents
  • You are not writing anything about how frequently this occurs , but it is possible that the connection request packet from the central to the peripheral from time to time is lost (e.g. due to interference). In such case the central will first get an BLE_GAP_EVT_CONNECTED event, then shortly after get an BLE_GAP_EVT_DISCONNECTED event with disconnect reason BLE_HCI_CONN_FAILED_TO_BE_ESTABLISHED. 

    Best regards,
    Kenneth

  • Hello Kenneth, as I say before, the connection works fine for a while, I can turn off, turn on, disconnect, reconnect several times, then suddenly, it stops working, no more connection to that particular peripheral, I don't believe it's something regarding interference, both devices are close, is not like it sometimes connects and sometimes it fail, when the issue happens, it never connect again. However, without moving anything, just deleting bonds or doing a mass erase it starts working again. 

    I'm going to try to sniff the packets to check if there's something wrong that I'm not seeing in the logs, however I need to get the nrf52 DK or the dongle to do it.

    Summarizing:

    1. After a fresh mass erase or a delete bonds, everything works fine, it always find the peripheral, connect, bonds, reconnects if disconnected, you can power off and on several times and everything works

    2. Suddenly the peripheral doesn't connect anymore, the Central is able to see the advertising packets from the peripheral, however as soon as the central tries to connect, the BLE_GAP_EVT_CONNECTED is received in the Central device but the peripheral doesn't receive anything, shortly after the central disconnects

    3. The peripheral is connectable through another central

    4. Scanning and connecting for another peripheral works fine (creating and using a new bond and a new peer entry)

    5. Deleting all bonds on the central and allowing the peripheral to refresh the bond information fix the connection issues

    6. Repeat from step 1

    I haven't count how many times the connection works before getting the issue, however I'm sure that it's more than at least 10 times, but probably a lot more.

  • nvelozsavino said:
    I've noticed that the problem happens when the central has more than one device in the whitelist,

    Do you have a sniffer log of failing and working? The next step would be to look at the content of the connection request packet from the central. The peripheral do not have any knowledge of the whitelist that may be present on the central, so this is indeed very odd.

    Kenneth

  • Here's the trace when the connection is successful, the packet id of the connection request is 11199

    connection_successful.pcapng

  • But this is not the same devices?

    Seems both the central and peripheral gap addresses are different between working and failing?

    Are you changing the GAP addresses of peers here?

    Kenneth

  • Hi Kenneth, no, it's not the same device, I have several peripherals and 4 master devices, it doesn't seem to be related with a specific hardware, I've seen the issue in all combinations.

    However I'll try to reproduce both a successful connection and the issue with the same pair of devices

  • Hi Kenneth, today I got the error again, I'm attaching the sniffer traces where it fails

    Within the file:  connection_unsuccessful_2.pcapng

    at line 854 you can see the Central trying to connect and the peripheral doesn't respond, both using their whitelists

    at line 2012 there's a connection again but this time both devices has their whitelist disabled, this time I can see that the peripheral is disconnecting, at line 2020

    after that, you will find lots of attemps with the whitelist enabled, Then I did a factory reset (mass erase) on the peripheral and starting line 4163 you can see the connection attempt and the peripheral is disconnecting.

    The problem is solved when I perform a mass erase on the central, as you can see in the successful file  connection_successful_3.pcapng

    as you can see in line 981

Reply
  • Hi Kenneth, today I got the error again, I'm attaching the sniffer traces where it fails

    Within the file:  connection_unsuccessful_2.pcapng

    at line 854 you can see the Central trying to connect and the peripheral doesn't respond, both using their whitelists

    at line 2012 there's a connection again but this time both devices has their whitelist disabled, this time I can see that the peripheral is disconnecting, at line 2020

    after that, you will find lots of attemps with the whitelist enabled, Then I did a factory reset (mass erase) on the peripheral and starting line 4163 you can see the connection attempt and the peripheral is disconnecting.

    The problem is solved when I perform a mass erase on the central, as you can see in the successful file  connection_successful_3.pcapng

    as you can see in line 981

Children
  • I assume that the central is f5:c6:75:d8:88:50 and the peripheral is e5:fa:95:ec:dc:61.

    From both logs I can see that the central send several scan request, but none are responded by a scan response by the peripheral. All the scan requests have invalid CRC according to the sniffer logs from the specific central, this lead me to believe you may have marginal hardware and/or some noise in the system. Is there any serial interfaces and/or gpios active with high drive strength that may impact the modulation of the scan request packet on the central? Have you done any DTM testing on your central hardware to verify modulation characteristics etc? Can you use an nRF52-DK for comparison?

    Also I can see that other central devices (e.g. 74:33:b4:05:18:b9 and 76:d6:47:c6:8e:94) seems to be able to send scan request packets successfully (some errors, but not all) and receive scan response from the peripheral during this time. Have you tried to call connect from any of these centrals for comparison?

    The connection request packets don't have any CRC errors based on the logs, so this leads me to believe that the whitelist on the peripheral may be wrong and/or some noise in the system, but of course difficult to know. Can you set a breakpoint in sd_ble_gap_whitelist_set() just to check that the addr_ptrs contain the central address (f5:c6:75:d8:88:50)?

    Have you tried to call ble_advertising_restart_without_whitelist() to just check? 

    I find it slightly inconclusive yet if it's the peripheral or central side that is the issue.

    Best regards,
    Kenneth

  • Hi Kenneth, I believe that the problem is definetly the Central, and I strongly doubt that there's something to do with noise, if that were the case, the probability of a successful connection should be higher than 0 after the problem manifest for the first time, what I've seen is that once the problem occours once, the central device never reconnects, other centrals can connect successfully to the same peripheral.

    I've tested deleting all bonds on the central without success, the only way it can work again without having to re-flash the MCU is doing a clean of all the data on the flash (only data stored by the peer manager, and the app, not the actual running program)

    My guess is that somehow the flash gets corrupted, something related to the info of the peer saved by the peer manager. I don't know if it could be that the central device goes to System OFF if there's no activity registered by an accelerometer, It might be possible that the device shuts down while a flash operation is in progress and this could cause the corruption, maybe between invalidating a file before adding the new updated one.

    We had a while ago another issue with another hardware, also a central device, after about 100 different connections to different peripherals (allowing only 1 at the time in the whitelist), the central stop connecting anymore and a clean of the flash was required. We haven't had any new reports from the user about this problem again, but I don't know if it's because we added an option to factory reset the central which solve the issue. Our guess that time was that the flash filled and there was a problem with the garbage collector that damaged the peers information. That time we were using a different sdk and softdevice version (s132 v5) and a different hardware, but now that I remember, it was the same behavior

  • nvelozsavino said:
    I don't know if it could be that the central device goes to System OFF if there's no activity registered by an accelerometer, It might be possible that the device shuts down while a flash operation is in progress and this could cause the corruption, maybe between invalidating a file before adding the new updated one.

    Do you have any logic that make sure nrf_fstorage_is_busy(NULL) does not return true before you are going to sleep? 

    nvelozsavino said:
    My guess is that somehow the flash gets corrupted, something related to the info of the peer saved by the peer manager.

    The problem with theory is that it would still connect if this was the case, only encryption would fail. I am thinking that the real problem here is that the two devices are not bonding at all. Do you have an on-air sniffer log that show bonding between the two peers before the re-connection fails?

  • HI Kenneth

    No, I don't have any logic to make sure the fstorage is not busy before going to sleep. I'll try to implement something like that to see if it helps.

    I've been noticing that the issue is happening more often now (the battery voltage is lower), I'm not sure if this could be important but we are not using the DC-DC. Is there a minimum voltage to ensure that the flash writes properly?

    If the problem is that the devices are not bonding, then 2 questions arise:

    1. How they lose the bonding information? Both? central and peripheral? or is the central who lose the bonding information, correct me if I'm wrong but this could happen if the flash gets corrupted right?

    2. The bonding information should be replaced when I delete the bonds on the central and reenable the whitelist on both devices, however, just deleting the bonds doesn't work, you have to do a flash erase in order to work again. As far as I understand deleting bonds is just invalidating the files stored by the peer manager in the fstorage. Again, my hypotesis still holds, if the flash is corrupted somehow, invalidating the files containing the bonds doesn't seem to work.

  • nvelozsavino said:
    I've been noticing that the issue is happening more often now (the battery voltage is lower), I'm not sure if this could be important but we are not using the DC-DC. Is there a minimum voltage to ensure that the flash writes properly?

    The flash can be read, written and erased over the entire supply voltage range yes. Though you may look into the stability of VDD if you think it is fluctuating a lot, becuse very large ripples (or if VDD step up) this can cause the power on reset to trigger, which will prevent any on-going flash operation or queue to finish. 

    You should also look into if you have any asserts or hardfaults, since those also typically will do a soft reset of the chip (default code of the error handlers), which also will prevent any on-going flash or queue to finish. For instance a flash erase operation can block the CPU execution with ~90ms, which may cause multiple interrupts to be pending for a long period of time.

    Have you tried to dump the entire content of the flash before and after it is failing? Then you can for instance look at the flash page tags and record layout:
    https://infocenter.nordicsemi.com/topic/com.nordic.infocenter.sdk5.v15.3.0/lib_fds_format.html#lib_fds_format_page 

    Please follow up on the: "Do you have an on-air sniffer log that show bonding between the two peers before the re-connection fails?". 

Related