This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Peer Manager deletes only peer while we still need it -- failed flash write race?

We are using nrf52840 as a peripheral, Nordic SDK 15.3.0_59ac345.

We are experiencing issues in the field where users are unable to reconnect with our device from our app on iOS phones. App logs indicate that link encryption is failing because Status: PIN or Key Missing (0x06). What follows is my interpretation of what's going on and a potential solution. Looking for feedback from y'all about the problem statement and any better solutions that may exist.

The garbage collection function, pm_handler_flash_clean (which we run from our register PM event handler), keeps a local flag to track that the next flash write succeeded after a garbage collection occurs. If it detects that a flash write after garbage collection fails it assumes that more drastic measures are needed and will mark the oldest peer for deletion and rerun garbage collection, which will result in bond information deletion for the selected peer. What I see happening is a race between multiple instances of flash writes failing because the flash is full (which can be detected synchronously, as apposed to actual flash write attempts).
  • flash write A starts and fails for flash full
  • GC A starts, sets flash_write_after_gc = false , successful write after GC will set back to true
  • flash write B starts and fails for flash full
  • GC B detects that flash_write_after_gc == false and marks the oldest peer for deletion and then starts
  • GC A completes
  • GC B completes, peer bond information has been deleted
In our case the oldest peer is the only peer and we fail to encrypt the link on the central with Status: PIN or Key Missing (0x06).
Here are pm some logs to demonstrate

00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <info> peer_manager_handler: Attempting to clean flash.
00> <debug> peer_manager_handler: Running flash garbage collection.
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <warning> peer_manager_gcm: Flash full. Could not store data for conn_handle: 0
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_FLASH_GARBAGE_COLLECTED
00> <debug> peer_manager_handler: Flash garbage collection complete.
00> <debug> peer_manager_handler: Event PM_EVT_FLASH_GARBAGE_COLLECTED
00> <debug> peer_manager_handler: Flash garbage collection complete.
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <info> peer_manager_handler: Attempting to clean flash.
00> <info> peer_manager_handler: Deleting lowest ranked peer (peer_id: 0)
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DATA_UPDATE_SUCCEEDED
00> <debug> peer_manager_handler: Peer data updated in flash: peer_id: 0, data_id: Local database, action: Update
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DATA_UPDATE_SUCCEEDED
00> <debug> peer_manager_handler: Peer data updated in flash: peer_id: 0, data_id: Local database, action: Update
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <info> peer_manager_handler: Attempting to clean flash.
00> <debug> peer_manager_handler: Running flash garbage collection.
00> <debug> peer_manager_handler: Event PM_EVT_STORAGE_FULL
00> <warning> peer_manager_handler: Flash storage is full
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DELETE_SUCCEEDED
00> <error> peer_manager_handler: Peer deleted successfully: peer_id: 0
00> <debug> peer_manager_handler: Event PM_EVT_PEER_DELETE_SUCCEEDED
00> <error> peer_manager_handler: Peer deleted successfully: peer_id: 0
If I understand the code and sequence of events correctly, garbage collection without peer deletion would actually be sufficient. My current plan is to disable peer ranking in our SDK configuration so that the peer manager will never auto delete any peers. We have internal handling that will mark peers for deletion when they are no longer needed.
Any suggestions or corrections based on the summary above?
  • Einar Thorsrud said:
    Jackson Burnette said:
    It seems like there is a lot of GATT DB cache updates that result in flash changes.

    Yes, you are right. The peer manager will write to flash every time a bonded peer connects, as discussed in this thread. This issue will be resolved in the next nRF5 SDK release which is right around the corner.

    Hi Einar Wave Has the issue been resolved in SDK 17.1.0? If so, is there a diff you can provide of the specific change?

  • Yes, this is resolved in 17.1.0. You can see the diff here:

    diff --git a/components/ble/peer_manager/gatt_cache_manager.c b/components/ble/peer_manager/gatt_cache_manager.c
    index e20bdc5..b5be461 100644
    --- a/components/ble/peer_manager/gatt_cache_manager.c
    +++ b/components/ble/peer_manager/gatt_cache_manager.c
    @@ -250,6 +250,8 @@ static bool local_db_update_in_evt(uint16_t conn_handle)
                 success = true;
                 break;
     
    +        case NRF_ERROR_INVALID_DATA:
    +            /* Fallthrough */
             case BLE_ERROR_INVALID_CONN_HANDLE:
                 /* Do nothing */
                 break;
    diff --git a/components/ble/peer_manager/gatts_cache_manager.c b/components/ble/peer_manager/gatts_cache_manager.c
    index d666537..c0919b5 100644
    --- a/components/ble/peer_manager/gatts_cache_manager.c
    +++ b/components/ble/peer_manager/gatts_cache_manager.c
    @@ -228,7 +228,44 @@ ret_code_t gscm_local_db_cache_update(uint16_t conn_handle)
     
                     if (err_code == NRF_SUCCESS)
                     {
    -                    err_code = pdb_write_buf_store(peer_id, PM_PEER_DATA_ID_GATT_LOCAL, peer_id);
    +                    pm_peer_data_flash_t curr_peer_data;
    +
    +                    err_code = pdb_peer_data_ptr_get(peer_id,
    +                                                PM_PEER_DATA_ID_GATT_LOCAL,
    +                                                &curr_peer_data);
    +
    +                    if ((err_code != NRF_SUCCESS) && (err_code != NRF_ERROR_NOT_FOUND))
    +                    {
    +                        NRF_LOG_ERROR("pdb_peer_data_ptr_get() returned %s for conn_handle: %d",
    +                                        nrf_strerror_get(err_code),
    +                                        conn_handle);
    +                        return NRF_ERROR_INTERNAL;
    +                    }
    +
    +                    if((err_code == NRF_ERROR_NOT_FOUND)
    +                        || (p_local_gatt_db->len != curr_peer_data.p_local_gatt_db->len)
    +                        || (memcmp(p_local_gatt_db->data, curr_peer_data.p_local_gatt_db->data,
    +                                    p_local_gatt_db->len) != 0))
    +                    {
    +                        err_code = pdb_write_buf_store(peer_id, PM_PEER_DATA_ID_GATT_LOCAL, peer_id);
    +                    }
    +                    else
    +                    {
    +                        NRF_LOG_DEBUG("Local db is already up to date, skipping write.");
    +                        ret_code_t err_code_release = pdb_write_buf_release(peer_id, PM_PEER_DATA_ID_GATT_LOCAL);
    +                        if (err_code_release == NRF_SUCCESS)
    +                        {
    +                            err_code = NRF_ERROR_INVALID_DATA;
    +                        }
    +                        else
    +                        {
    +                            NRF_LOG_ERROR("Did another thread manipulate PM_PEER_DATA_ID_GATT_LOCAL for "\
    +                                          "peer_id %d at the same time? pdb_write_buf_release() returned %s.",
    +                                          peer_id,
    +                                          nrf_strerror_get(err_code_release));
    +                            err_code = NRF_ERROR_INTERNAL;
    +                        }
    +                    }
                     }
                     else
                     {
    diff --git a/components/ble/peer_manager/gatts_cache_manager.h b/components/ble/peer_manager/gatts_cache_manager.h
    index c271667..1356eea 100644
    --- a/components/ble/peer_manager/gatts_cache_manager.h
    +++ b/components/ble/peer_manager/gatts_cache_manager.h
    @@ -78,6 +78,8 @@ ret_code_t gscm_init(void);
      * @retval NRF_SUCCESS                    Store operation started.
      * @retval BLE_ERROR_INVALID_CONN_HANDLE  conn_handle does not refer to an active connection with a
      *                                        bonded peer.
    + * @retval NRF_ERROR_INVALID_DATA         Refused to update because the GATT database is already up
    + *                                        to date.
      * @retval NRF_ERROR_BUSY                 Unable to perform operation at this time. Reattempt later.
      * @retval NRF_ERROR_DATA_SIZE            Write buffer not large enough. Call will never work with
      *                                        this GATT database.
    

Related