Connections drop and don't come back

This is a continuation of my previous thread (Connections drop and don't come back (and "SoftDevice Controller ASSERT: 53, 296")) (Case ID: 341823).

The project is a star network in which the Central is housed in a steel enclosure with a patch antenna perhaps 1cm from the steel surface. Currently, the network is running with Coded PHY. The Central is always scanning for Peripherals, but turns off scanning after a Peripheral is detected (with the required device name and service UUID) to establish the connection and then to read various characteristics with metadata and control information, after which scanning is turned back on. Thereafter, the Peripheral reports its environmental readings via Notify every 60 seconds. If a connection is lost for whatever reason, the Peripheral resumes advertising, and the reconnection process is repeated.

When the Central is facing the Peripherals, everything works great. Occasionally, connections are dropped, and after a bit (about 20 seconds) the Peripheral comes back online. However, when the box is flipped over such that the steel box is between the antenna and the Peripherals, the reported RSSI is worse by 5 to 10dB, and the rate at which the Peripherals lose their connections increases significantly. This is not surprising. However, some unknown condition occurs hours to days later that prevents the Central from seeing the Peripherals any more - they drop and don't come back. A DK running the Central software right next to the box (chip antenna, not shadowed by steel) can see the Peripherals advertising.

The action recommended from the previous thread was to update the NCS. I did so, migrating from 2.6.0 to 2.9.1. Under 2.6.0, I would also see the occasional SoftDevice assertion. Under 2.9.1, I don't see the assertion anymore. Apparently, the assertion was not the root cause of the "drop and don't come back" phenomenon.

The other error conditions previously observed are still present:

  • Failure of Discovery to obtain a characteristic handle
  • Failure of a characteristic Read operation
  • Failure reported in a callback following bt_conn_le_create() (error code 2; BT_HCI_ERR_UNKNOWN_CONN_ID)
  • Failure reported in a callback following bt_gatt_read() (error code 14; BT_ATT_ERR_UNLIKELY)

These errors are reported continuously until the "drop and don't come back" phenomenon happens.  In the event of these new-connection-oriented errors, the connection is dropped and allowed to re-establish per the above process.

Parents
  • Some more information:

    I have been capturing the internally-generated error codes generated by the Central as it runs, and noticed a pattern just before the "drop and don't come back" event occurs.  I've got a Discovery process that looks for the UUID of my custom service, then the characteristics, and then the CCCD attribute; if something prevents the service discovery step from completing, then the connection is dropped.  This is the 2nd-to-last error reported.  Due to a logic error in my code, the connection is being dropped _twice_.  After the first one, the bt_conn pointer is being dereferenced and set to NULL.  Then the second call to bt_conn_disconnect occurs.

    I'm looking at the bt_conn_disconnect() function in zephyr/subsys/bluetooth/host/conn.c.  It doesn't look like there are any tests performed on conn before using it.  I shouldn't be surprised that trying to use a NULL pointer would cause a fault.  Could be this is killing the entire Bluetooth thread, so it isn't just advertising that isn't working, it's all the BLE stuff.  I guess I shouldn't be surprised that the interrupt-driven console part of my program would continue to run.

    I can believe that a poor-quality RF scenario could cause an incident during Discovery; this is why I never see it except when the case is flipped over, and only rarely even then.

    I've fixed my logic bug and am rerunning the test to see if the "drop and don't come back" problem recurs.  If it does, I will attempt to enable full logging to console and try to figure out how to capture it in this embedded application.

Reply
  • Some more information:

    I have been capturing the internally-generated error codes generated by the Central as it runs, and noticed a pattern just before the "drop and don't come back" event occurs.  I've got a Discovery process that looks for the UUID of my custom service, then the characteristics, and then the CCCD attribute; if something prevents the service discovery step from completing, then the connection is dropped.  This is the 2nd-to-last error reported.  Due to a logic error in my code, the connection is being dropped _twice_.  After the first one, the bt_conn pointer is being dereferenced and set to NULL.  Then the second call to bt_conn_disconnect occurs.

    I'm looking at the bt_conn_disconnect() function in zephyr/subsys/bluetooth/host/conn.c.  It doesn't look like there are any tests performed on conn before using it.  I shouldn't be surprised that trying to use a NULL pointer would cause a fault.  Could be this is killing the entire Bluetooth thread, so it isn't just advertising that isn't working, it's all the BLE stuff.  I guess I shouldn't be surprised that the interrupt-driven console part of my program would continue to run.

    I can believe that a poor-quality RF scenario could cause an incident during Discovery; this is why I never see it except when the case is flipped over, and only rarely even then.

    I've fixed my logic bug and am rerunning the test to see if the "drop and don't come back" problem recurs.  If it does, I will attempt to enable full logging to console and try to figure out how to capture it in this embedded application.

Children
Related