Hello,
I'm facing an issue dealing with the peer manager module on the NRF52840 chip using the SDK version 17 and softdevice 7.2.0.
The steps to reproduce the issue are detailed here :
1. A BLE connection and a bonding procedure has been done between 2 devices with numerical comparison method
2. A device acts as a GATT server while the other device is a GATT client.
3. The client subscribes to notifications on some characteristics writing 1 to all needed CCCD.
4. Then the bonding information is erased on the client, the LTK is lost and a disconnection occurs between the 2 devices. The server still keeps the old bonding information in the NVM.
5. The server, acting as peripheral, is then advertising and the server as a central is scanning and trying to connect to the peripheral.
6. Once the connection is established a PM_EVT_BONDED_PEER_CONNECTED event is immediately generated by the peer manager followed by a PM_EVT_LOCAL_DB_CACHE_APPLIED event meaning that the previous database configuration is provided to the softdevice thanks to the peer manager, the gatts_cache_manager module and finally the calls to the softdevice using the api sd_ble_gatts_sys_attr_set ... It also means that the CCCD configuration for all the services has been provided to the softdevice, so the server still considers that the notifications are enabled for the involved ble characteristics.
7. Then we receive a PM_EVT_CONN_SEC_START event followed by a PM_EVT_CONN_SEC_CONFIG_REQ because the peer device requires a new bonding procedure as he lost the LTK. We allow the re-bonding by calling the api pm_conn_sec_config_reply with the required parameters. We receive the event PM_EVT_CONN_SEC_SUCCEEDED, the bonding succeeded.
8. Here the problem becomes clear : the client has not yet discover all services nor enabled all CCCD for the needed characteristics. But in the meantime we call that API sd_ble_gatts_hvx to indicate something to the peer and the call succeeds ! ... But finally we don't receive the confirmation in due time because the peer considers that notications have not been enabled yet for that connection, which then leads to the generation of a BLE_GATTS_EVT_TIMEOUT event, and finally we need to terminate the connection.
You may say we could wait till the client enables again the notifications but in our software, we deeply rely on the NRF_ERROR_BUSY or NRF_ERROR_INVALID_STATE returned by the softdevice to enqueue our commands so that we can retry when either the softdevice is available or when the state is valid. But because the PM_EVT_LOCAL_DB_CACHE_APPLIED has been generated and the local obsolete database transmitted to the softdevice, no error is thrown by the softdevice leading to the mentionned GATTS timeout.
I joined here the RTT traces I captured with some comments that might help understand the flow of execution :
>>> Here the central has lost the LTK and tries to connect again to the peripheral
00> [12:00:03.240,234] <info> MODULE_BLE_TASK: peripheral is advertising
00> [12:00:04.027,343] <info> MODULE_BLE_TASK: Connected.
00> [12:00:04.027,343] <info> MODULE_BLE_TASK: pm_peer_count = 2
00> [12:00:04.029,296] <info> MODULE_BLE_TASK: Connected to previously bonded device
00> [12:00:04.029,296] <info> MODULE_BLE_TASK: PM_EVT_LOCAL_DB_CACHE_APPLIED >>> localdatabase transmitted to the softdevice
00> [12:00:04.029,296] <info> MODULE_BLE_TASK: pm_peer_data_bonding_load: 0x00000000
00> [12:00:04.029,296] <info> MODULE_BLE_TASK: BLE LTK: 0x071EE9A6317E85C92404231461AC348F
...
00> [12:00:05.574,218] <info> MODULE_BLE_TASK: PM_EVT_CONN_SEC_START
00> [12:00:05.574,218] <info> MODULE_BLE_TASK: PM_EVT_CONN_SEC_CONFIG_REQ >>> request for rebonding procedure
00> [12:00:05.605,468] <info> MODULE_BLE_TASK: Handling LESC service request.
00> [12:00:05.718,750] <info> MODULE_BLE_TASK: Passkey: 652846
00> [12:00:06.294,921] <info> MODULE_BLE_TASK: sd_ble_gap_auth_key_reply: [OK]
00> [12:00:06.714,843] <info> MODULE_BLE_TASK: Connection secured: role: 1, conn_handle: 0x0, procedure: 1.
00> [12:00:06.714,843] <info> MODULE_BLE_TASK: Security mode: 1, level: 4.
00> [12:00:06.714,843] <info> MODULE_BLE_TASK: New bond information stored.
...
00> [12:00:06.861,328] <info> MODULE_BLE_TASK: PM_EVT_PEER_DATA_UPDATE_SUCCEEDED
00> [12:00:06.888,671] <info> GATT_SERVICE_X_MODULE: SD_HVI call : - 0x00000000 >>> call to sd_ble_gatts_hvx which return 0 == No ERROR !
....
00> [12:00:07.750,000] <info> GATT_SERVICE_X_MODULE: HVI enabled by the client. >>> Here the client has subscribed to the notification but it is already too late
00> [10:09:29.500,000] <info> GATT_SERVICE_Y_MODULE: time update >>> the time has been updated with another service, thus the RTT log timestamps are modified
>>> The GATT TIMEOUT event is 30 seconds after the first command is sent. We then need to advertise again.
00> [10:09:57.388,671] <info> MODULE_BLE_TASK: GATTS timeout.
00> [10:09:57.388,671] <info> MODULE_BLE_TASK: BLE off.
00> [10:09:57.410,156] <info> MODULE_BLE_TASK: Disconnected. Reason: 0x00000016 (BLE_HCI_LOCAL_HOST_TERMINATED_CONNECTION)
00> [10:09:57.410,156] <info> MODULE_BLE_TASK: Tx power: 0 dBm
00> [10:09:57.412,109] <info> MODULE_BLE_TASK: Fast advertising, id=723
After that very long introduction, I wanted your help/opinion about the fix I came up with : it just consists in modifying directly the SDK and precisely that module gatt_cache_manager.c by not executing the local_db_apply_in_event function to make sure that we do not feed the softdevice with obsolete information.
But but but, I'm not 100% sure about the consequences of that change because even if in our system a full service discovery + enabling of notifications are performed each time there's a reconnection, I'm afraid it's not following the ble specs or can have some side effects on the behaviour of the SDK ( which I didn't notice so far but who knows ... ) .
So my questions are :
1. Is that a known problem or could it be recognized as a problem ?
2. Do you think it's an acceptable fix or could it lead to some unexpected behaviour due to non-respect of Bluetooth specifications ?
Thank you for your help!