Peer Manager deletes only peer while we still need it -- failed flash write race?

Question

We are using nrf52840 as a peripheral, Nordic SDK 15.3.0_59ac345.

We are experiencing issues in the field where users are unable to reconnect with our device from our app on iOS phones. App logs indicate that link encryption is failing because Status: PIN or Key Missing (0x06). What follows is my interpretation of what's going on and a potential solution. Looking for feedback from y'all about the problem statement and any better solutions that may exist.

The garbage collection function, pm_handler_flash_clean (which we run from our register PM event handler), keeps a local flag to track that the next flash write succeeded after a garbage collection occurs. If it detects that a flash write after garbage collection fails it assumes that more drastic measures are needed and will mark the oldest peer for deletion and rerun garbage collection, which will result in bond information deletion for the selected peer. What I see happening is a race between multiple instances of flash writes failing because the flash is full (which can be detected synchronously, as apposed to actual flash write attempts).

flash write A starts and fails for flash full
GC A starts, sets flash_write_after_gc = false , successful write after GC will set back to true
flash write B starts and fails for flash full
GC B detects that flash_write_after_gc == false and marks the oldest peer for deletion and then starts
GC A completes
GC B completes, peer bond information has been deleted

In our case the oldest peer is the only peer and we fail to encrypt the link on the central with Status: PIN or Key Missing (0x06).

Here are pm some logs to demonstrate

00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <info> peer_manager_handler: Attempting to clean flash.
00> <debug> peer_manager_handler: Running flash garbage collection.
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <warning> peer_manager_gcm: Flash full. Could not store data for conn_handle: 0
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_FLASH_GARBAGE_COLLECTED
00> <debug> peer_manager_handler: Flash garbage collection complete.
00> <debug> peer_manager_handler: Event PM_EVT_FLASH_GARBAGE_COLLECTED
00> <debug> peer_manager_handler: Flash garbage collection complete.
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <info> peer_manager_handler: Attempting to clean flash.
00> <info> peer_manager_handler: Deleting lowest ranked peer (peer_id: 0)
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DATA_UPDATE_SUCCEEDED
00> <debug> peer_manager_handler: Peer data updated in flash: peer_id: 0, data_id: Local database, action: Update
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DATA_UPDATE_SUCCEEDED
00> <debug> peer_manager_handler: Peer data updated in flash: peer_id: 0, data_id: Local database, action: Update
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <info> peer_manager_handler: Attempting to clean flash.
00> <debug> peer_manager_handler: Running flash garbage collection.
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DELETE_SUCCEEDED
00> <error> peer_manager_handler: Peer deleted successfully: peer_id: 0
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DELETE_SUCCEEDED
00> <error> peer_manager_handler: Peer deleted successfully: peer_id: 0

If I understand the code and sequence of events correctly, garbage collection without peer deletion would actually be sufficient. My current plan is to disable peer ranking in our SDK configuration so that the peer manager will never auto delete any peers. We have internal handling that will mark peers for deletion when they are no longer needed.

Any suggestions or corrections based on the summary above?

Einar Thorsrud · Accepted Answer

Hi, 
 If I understand the code and sequence of events correctly, garbage collection without peer deletion would actually be sufficient. 
 That is the case if you have deleted or updated some records without doing garbage collection before. If so, garbage collection will free the old dirty records, so that you get available space. If not, and you actually have too much data stored (typically also from other FDS data, not related to the peer manger), it will not be enough. 
 There are a few things to note here, some of which are related to a few things not handled very well in the peer manager. First of all, you cannot delete the bond for a peer that is currently being used. That will lead to undefined behavior in the peer manager. However, the peer manager does not prevent deleting the bond for a peer that is currently connected. Secondly, as you note, the peer manger PM_EVT_STORAGE_FULL will trigger deletion of the oldest peer even if it is the only peer. 
 My current plan is to disable peer ranking in our SDK configuration so that the peer manager will never auto delete any peers. We have internal handling that will mark peers for deletion when they are no longer needed. 
 
 Any suggestions or corrections based on the summary above? 
 That makes sense. Either that, or improve the peer manager handling of this scenario. Altos, another approach sometimes used is to regularly check the amount of free space in FDS, and do a garbage collection if there is a need for that (there is memory to be freed, and amount of free space is limited). Potentially also delete old and no longer needed data before doing the GC. 
 Einar

Einar Thorsrud · Answer

Hi, 
 Jackson Burnette said: It seems like there is a lot of GATT DB cache updates that result in flash changes. 
 Yes, you are right. The peer manager will write to flash every time a bonded peer connects, as discussed in this thread . This issue will be resolved in the next nRF5 SDK release which is right around the corner. 
 Jackson Burnette said: Does the peer manager take care of deleting the stale GATT DB caches so that they are cleaned up on GC? 
 Yes, old caches are deleted, so GC will free that space.

Einar Thorsrud · Answer

Yes, this is resolved in 17.1.0. You can see the diff here:

Fullscreen 
 cache_fix.diff 
 Download 
 
 diff --git a/components/ble/peer_manager/gatt_cache_manager.c b/components/ble/peer_manager/gatt_cache_manager.c
index e20bdc5..b5be461 100644
--- a/components/ble/peer_manager/gatt_cache_manager.c
+++ b/components/ble/peer_manager/gatt_cache_manager.c
@@ -250,6 +250,8 @@ static bool local_db_update_in_evt(uint16_t conn_handle)
 success = true;
 break;
 
+ case NRF_ERROR_INVALID_DATA:
+ /* Fallthrough */
 case BLE_ERROR_INVALID_CONN_HANDLE:
 /* Do nothing */
 break;
diff --git a/components/ble/peer_manager/gatts_cache_manager.c b/components/ble/peer_manager/gatts_cache_manager.c
index d666537..c0919b5 100644
--- a/components/ble/peer_manager/gatts_cache_manager.c
+++ b/components/ble/peer_manager/gatts_cache_manager.c
@@ -228,7 +228,44 @@ ret_code_t gscm_local_db_cache_update(uint16_t conn_handle)
 
 if (err_code == NRF_SUCCESS)
 {
- err_code = pdb_write_buf_store(peer_id, PM_PEER_DATA_ID_GATT_LOCAL, peer_id);
+ pm_peer_data_flash_t curr_peer_data;
+
+ err_code = pdb_peer_data_ptr_get(peer_id,
+ PM_PEER_DATA_ID_GATT_LOCAL,
+ &curr_peer_data);
+
+ if ((err_code != NRF_SUCCESS) && (err_code != NRF_ERROR_NOT_FOUND))
+ {
+ NRF_LOG_ERROR("pdb_peer_data_ptr_get() returned %s for conn_handle: %d",
+ nrf_strerror_get(err_code),
+ conn_handle);
+ return NRF_ERROR_INTERNAL;
+ }
+
+ if((err_code == NRF_ERROR_NOT_FOUND)
+ || (p_local_gatt_db->len != curr_peer_data.p_local_gatt_db->len)
+ || (memcmp(p_local_gatt_db->data, curr_peer_data.p_local_gatt_db->data,
+ p_local_gatt_db->len) != 0))
+ {
+ err_code = pdb_write_buf_store(peer_id, PM_PEER_DATA_ID_GATT_LOCAL, peer_id);
+ }
+ else
+ {
+ NRF_LOG_DEBUG("Local db is already up to date, skipping write.");
+ ret_code_t err_code_release = pdb_write_buf_release(peer_id, PM_PEER_DATA_ID_GATT_LOCAL);
+ if (err_code_release == NRF_SUCCESS)
+ {
+ err_code = NRF_ERROR_INVALID_DATA;
+ }
+ else
+ {
+ NRF_LOG_ERROR("Did another thread manipulate PM_PEER_DATA_ID_GATT_LOCAL for "\
+ "peer_id %d at the same time? pdb_write_buf_release() returned %s.",
+ peer_id,
+ nrf_strerror_get(err_code_release));
+ err_code = NRF_ERROR_INTERNAL;
+ }
+ }
 }
 else
 {
diff --git a/components/ble/peer_manager/gatts_cache_manager.h b/components/ble/peer_manager/gatts_cache_manager.h
index c271667..1356eea 100644
--- a/components/ble/peer_manager/gatts_cache_manager.h
+++ b/components/ble/peer_manager/gatts_cache_manager.h
@@ -78,6 +78,8 @@ ret_code_t gscm_init(void);
 * @retval NRF_SUCCESS Store operation started.
 * @retval BLE_ERROR_INVALID_CONN_HANDLE conn_handle does not refer to an active connection with a
 * bonded peer.
+ * @retval NRF_ERROR_INVALID_DATA Refused to update because the GATT database is already up
+ * to date.
 * @retval NRF_ERROR_BUSY Unable to perform operation at this time. Reattempt later.
 * @retval NRF_ERROR_DATA_SIZE Write buffer not large enough. Call will never work with
 * this GATT database.

Peer Manager deletes only peer while we still need it -- failed flash write race?

Top Replies