Central resets upon BLE Read following forced disconnect from Peripheral

This is a long shot, but I'll ask; maybe you all have seen something like that or can give me some debug/troubleshoot tips.

NCS v2.6.0

Project is a Central node that can connect to up to 20 Peripheral nodes.  Central has a NVS partition for pair/bond info and a separate NVS partition for NVS storage (I've already gotten lots of help from you all to set up separate NVS partitions under Partition Manager, and it's working).  Central has a command/response UART interface (essentially a shell), and I can printf info to the terminal as well as issue commands.  Peripherals have their UARTs enabled so I can printf info from them, too.  I'm using nRF52840DKs for development purposes.

User sends a command to Central to register a Peripheral; Central stores the info (essentially the advertising name) in NVS.  If it sees a device advertising with this info, it establishes a connection, performs Discovery to verify the Peripheral has the Custom Service and to obtain handles to the characteristics.  It then calls  bt_conn_set_security(conn, BT_SECURITY_L2) to pair/bond with the Peripheral.  After this, it reads some characteristics from Peripheral with bt_gatt_read() and a callback function in the bt_gatt_read_params argument.

So far, so good.  Works fine.

The user can also send a command to Central to unregister a Peripheral.  Central removes the info from NVS, calls bt_unpair() to remove the pair/bond info from NVS, and then calls bt_conn_disconnect(conn, BT_HCI_ERR_REMOTE_USER_TERM_CONN) to disconnect.  Peripheralhas a disconnected() callback; if reason == BT_HCI_ERR_REMOTE_USER_TERM_CONN, it also calls bt_unpair() to wipe the pair/bond info from its NVS.

So far, so good.  A Peripheral can be added and connects and reads successfully.  It can be removed, and goes back to advertising.  It can then be added again, and it connects, discovers, pair/bonds, and reads.  Just fine.

The system is intended to handle inadvertent disconnects.  The info on a Peripheral, including its connection pointer, the characteristic handles, etc., is an object essentially in an array.  If a Peripheral goes out of range, or loses power, the Central detects the loss of connection, and calls bt_conn_unref() on the Peripheral's connection pointer before setting it to NULL to indicate an unused connection entry.  The same thing it does when unregistering a Peripheral.

The problem is, when a Peripheral that was connected and working fine is inadvertently disconnected - the DK board is reset - most of the logic works.  Central detects the dropped connection, and unrefs and NULLs the connection pointer.  It then detects that Peripheral advertising again, finds the Peripheral's info in NVS, and proceeds along the sequence: It establishes the connection.  It discovers the characteristics and gets the handles.  It elevates the security level according to the keys that it and the Peripheral have stored in their NVS settings.  *It's talking to the Peripheral*.

Then it attempts to read a characteristic.  A printf after the call to bt_gatt_read() indicates it got this far, and that bt_gatt_read() does not return an error.  The Central resets.  A printf in the callback function is never called, indicating that the reset occurs before the callback could be called.  A printf in the Peripheral's read handler, registered with BT_GATT_SERVICE_DEFINE and BT_GATT_CHARACTERISTIC, is never called, indicating that the reset occurs before the Read operation actually begins (and so the Peripheral is above suspicion... or is it?).

Needless to say, after the Central resets (and the Peripheral drops the connection again), it finds the Peripheral advertising and successfully connects to it.  The reset is (unsurprisingly) resetting whatever condition caused the reset.

I've done some troubleshooting already, such as printing out the connection info just before the bt_gatt_read() call.  It is the same, whether from a successful connection/communication run or a run aborted by a reset:

periph_ready_task: error code 0, conn 0x20002670, bt_gatt_read returns 0
conn info:
  type = 1
  role = 0
  id = 0
  state = 2
  sec level = 2
  sec keysize = 16
  sec flags = 1
  dst = CC:70:8F:77:49:97 (random)
  src = EB:42:8F:DF:B1:A6 (random)
  local = EB:42:8F:DF:B1:A6 (random)
  remote = CC:70:8F:77:49:97 (random)
  interval = 320
  latency = 0
  timeout = 200

It appears to me that something in the Zephyr/Nordic BLE subsystem is stuck with bad info after the dropped connection and it's not being handled.  I'd appreciate some advice on how to proceed from here.

Parents
  • Hi David,

    From your description, I think the issue is more likely with the handling of the bt_conn struct.

    Have you check the logic that handles the removal of the old peripheral, and the re-addition of that peripheral is working correctly?

    I think it might be worth it to check the address in memory of the bt_conn struct at various points to be sure that the correct one is in use. Definitely check at the connected callback and right before the bt_gatt_read() call.

    Then it attempts to read a characteristic.  A printf after the call to bt_gatt_read() indicates it got this far, and that bt_gatt_read() does not return an error.  The Central resets.  A printf in the callback function is never called, indicating that the reset occurs before the callback could be called.  A printf in the Peripheral's read handler, registered with BT_GATT_SERVICE_DEFINE and BT_GATT_CHARACTERISTIC, is never called, indicating that the reset occurs before the Read operation actually begins (and so the Peripheral is above suspicion... or is it?).

    Do you mean printk() here?

    printf() is not supported natively, and depends on the implementation, could be deferred, so I cannot be sure it is a good indicator of code progression.

    The surest method I think is to use debugging and use a breakpoint.

    Another thing that helps with debugging is to enable CONFIG_RESET_ON_FATAL_ERROR so that the execution halt at the error rather than reset.

    Hieu

Reply
  • Hi David,

    From your description, I think the issue is more likely with the handling of the bt_conn struct.

    Have you check the logic that handles the removal of the old peripheral, and the re-addition of that peripheral is working correctly?

    I think it might be worth it to check the address in memory of the bt_conn struct at various points to be sure that the correct one is in use. Definitely check at the connected callback and right before the bt_gatt_read() call.

    Then it attempts to read a characteristic.  A printf after the call to bt_gatt_read() indicates it got this far, and that bt_gatt_read() does not return an error.  The Central resets.  A printf in the callback function is never called, indicating that the reset occurs before the callback could be called.  A printf in the Peripheral's read handler, registered with BT_GATT_SERVICE_DEFINE and BT_GATT_CHARACTERISTIC, is never called, indicating that the reset occurs before the Read operation actually begins (and so the Peripheral is above suspicion... or is it?).

    Do you mean printk() here?

    printf() is not supported natively, and depends on the implementation, could be deferred, so I cannot be sure it is a good indicator of code progression.

    The surest method I think is to use debugging and use a breakpoint.

    Another thing that helps with debugging is to enable CONFIG_RESET_ON_FATAL_ERROR so that the execution halt at the error rather than reset.

    Hieu

Children
No Data
Related