This is a long shot, but I'll ask; maybe you all have seen something like that or can give me some debug/troubleshoot tips.
NCS v2.6.0
Project is a Central node that can connect to up to 20 Peripheral nodes. Central has a NVS partition for pair/bond info and a separate NVS partition for NVS storage (I've already gotten lots of help from you all to set up separate NVS partitions under Partition Manager, and it's working). Central has a command/response UART interface (essentially a shell), and I can printf info to the terminal as well as issue commands. Peripherals have their UARTs enabled so I can printf info from them, too. I'm using nRF52840DKs for development purposes.
User sends a command to Central to register a Peripheral; Central stores the info (essentially the advertising name) in NVS. If it sees a device advertising with this info, it establishes a connection, performs Discovery to verify the Peripheral has the Custom Service and to obtain handles to the characteristics. It then calls bt_conn_set_security(conn, BT_SECURITY_L2) to pair/bond with the Peripheral. After this, it reads some characteristics from Peripheral with bt_gatt_read() and a callback function in the bt_gatt_read_params argument.
So far, so good. Works fine.
The user can also send a command to Central to unregister a Peripheral. Central removes the info from NVS, calls bt_unpair() to remove the pair/bond info from NVS, and then calls bt_conn_disconnect(conn, BT_HCI_ERR_REMOTE_USER_TERM_CONN) to disconnect. Peripheralhas a disconnected() callback; if reason == BT_HCI_ERR_REMOTE_USER_TERM_CONN, it also calls bt_unpair() to wipe the pair/bond info from its NVS.
So far, so good. A Peripheral can be added and connects and reads successfully. It can be removed, and goes back to advertising. It can then be added again, and it connects, discovers, pair/bonds, and reads. Just fine.
The system is intended to handle inadvertent disconnects. The info on a Peripheral, including its connection pointer, the characteristic handles, etc., is an object essentially in an array. If a Peripheral goes out of range, or loses power, the Central detects the loss of connection, and calls bt_conn_unref() on the Peripheral's connection pointer before setting it to NULL to indicate an unused connection entry. The same thing it does when unregistering a Peripheral.
The problem is, when a Peripheral that was connected and working fine is inadvertently disconnected - the DK board is reset - most of the logic works. Central detects the dropped connection, and unrefs and NULLs the connection pointer. It then detects that Peripheral advertising again, finds the Peripheral's info in NVS, and proceeds along the sequence: It establishes the connection. It discovers the characteristics and gets the handles. It elevates the security level according to the keys that it and the Peripheral have stored in their NVS settings. *It's talking to the Peripheral*.
Then it attempts to read a characteristic. A printf after the call to bt_gatt_read() indicates it got this far, and that bt_gatt_read() does not return an error. The Central resets. A printf in the callback function is never called, indicating that the reset occurs before the callback could be called. A printf in the Peripheral's read handler, registered with BT_GATT_SERVICE_DEFINE and BT_GATT_CHARACTERISTIC, is never called, indicating that the reset occurs before the Read operation actually begins (and so the Peripheral is above suspicion... or is it?).
Needless to say, after the Central resets (and the Peripheral drops the connection again), it finds the Peripheral advertising and successfully connects to it. The reset is (unsurprisingly) resetting whatever condition caused the reset.
I've done some troubleshooting already, such as printing out the connection info just before the bt_gatt_read() call. It is the same, whether from a successful connection/communication run or a run aborted by a reset:
periph_ready_task: error code 0, conn 0x20002670, bt_gatt_read returns 0
conn info:
type = 1
role = 0
id = 0
state = 2
sec level = 2
sec keysize = 16
sec flags = 1
dst = CC:70:8F:77:49:97 (random)
src = EB:42:8F:DF:B1:A6 (random)
local = EB:42:8F:DF:B1:A6 (random)
remote = CC:70:8F:77:49:97 (random)
interval = 320
latency = 0
timeout = 200
It appears to me that something in the Zephyr/Nordic BLE subsystem is stuck with bad info after the dropped connection and it's not being handled. I'd appreciate some advice on how to proceed from here.