This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Bug in peer manager in SDK 14.2

I have recently ported my nRF52832 application from SDK 12.3 to SDK 14.2 and have been disappointed to find that the preripheral is locking up after a period in operation.

In order to track down the issue I left the debugger running until the application froze (a few hours) and found that peer manager seems to be caught in a infinite loop trying to clear some flash. Exactly what it is trying to store is unclear - I only have one central bonded.

I have since built a short central application that repeatedly connects and disconnects from the peripheral at 2 second intervals which has made this bug repeatable - here is an example stack trace when I hit pause in the debugger after the application has frozen:

image description

I have tried to work through the peer manager code and it appears that pds_peer_data_store attempts to use FDS to write to flash but gets FDS_ERR_NO_SPACE_IN_FLASH. FDS then attempts to do come garbage collection after which the peer manager attempts again to write to flash but again gets the no space error. This continues indefinitely.

Is this a known issue and if so is there any workaround recommended?

Update: I have tracked the memory leak to the function gscm_local_db_cache_update:

image description

Prior to this function being executed I get the following debug output to the console:

app: Connected to a previously bonded device.
app: Connected
app: Connection secured. Role: 1. conn_handle: 0x0, Procedure: 0
ble_gatt: Peer on connection 0x0 requested an ATT MTU of 185 bytes.
ble_gatt: Updating ATT MTU to 23 bytes (desired: 23) on connection 0x0.

I'm not sure what is causing this MTU negotiation but it appears that the SDK is trying to cache the connection parameter. Rather than updating the existing record for the peer it is creating a new one, so after a few connect/disconnect cycles the flash fills up and it gets stuck in the garbage collecting loop.

So I guess my question is:

  • What might be causing this MTU negotiation from the peer in the first place?

  • What needs to be done to fix the memory leak?

FURTHER UPDATE: I have added some additional logging to give visibility of the event flow:

nrf_sdh_ble: BLE event: 0x10.  // BLE_GAP_EVT_CONNECTED
app: Connected to a previously bonded device.
app: Connected
nrf_sdh_ble: BLE event: 0x14.  // BLE_GAP_EVT_SEC_INFO_REQUEST
nrf_sdh_ble: BLE event: 0x1A.  // BLE_GAP_EVT_CONN_SEC_UPDATE
app: Connection secured. Role: 1. conn_handle: 0x0, Procedure: 0
nrf_sdh_ble: BLE event: 0x21.  // BLE_GAP_EVT_SCAN_REQ_REPORT
app: PHY update request.
nrf_sdh_ble: BLE event: 0x22.  // BLE_GAP_EVT_PHY_UPDATE_REQUEST
nrf_sdh_ble: BLE event: 0x23.  // BLE_GAP_EVT_PHY_UPDATE
ble_gatt: Peer on connection 0x0 requested a data length of 251 bytes.
ble_gatt: Updating data length to 27 bytes on connection 0x0.
nrf_sdh_ble: BLE event: 0x24.  // BLE_GAP_EVT_DATA_LENGTH_UPDATE_REQUEST
ble_gatt: Data length updated to 27 on connection 0x0.
ble_gatt: max_rx_octets: 27
ble_gatt: max_tx_octets: 27
ble_gatt: max_rx_time: 328
ble_gatt: max_tx_time: 328
nrf_sdh_ble: BLE event: 0x55. // BLE_GATTS_EVT_EXCHANGE_MTU_REQUEST
ble_gatt: Peer on connection 0x0 requested an ATT MTU of 293 bytes.
ble_gatt: Updating ATT MTU to 23 bytes (desired: 23) on connection 0x0.
nrf_sdh_ble: BLE event: 0x50. // BLE_GATTS_EVT_WRITE
***** DATA GETS WRITTEN TO FLASH HERE via pds_peer_data_store -> fds_record_update *****
nrf_sdh_ble: BLE event: 0x12. // BLE_GAP_EVT_CONN_PARAM_UPDATE

I'm not sure if this event flow is usual for connection to a bonded device? It seems to be a lot of data to be exchanged each time a connection is established!

What I have managed to confirm is that the updated peer data is written to the same FDS file_id and key each time, so at least from that angle the peer data is being stored correctly. So if the above event flow is expected then I guess the issue is not with peer manager but with the garbage collection in FDS?

It looks like others may also have experienced this problem:

devzone.nordicsemi.com/.../

  • You say GC is attempted? Could you elaborate?

    Do you get the PM_EVT_STORAGE_FULL event in pm_evt_handler()? If you do, do fds_gc() return 0x00000000 (NRF_SUCCESS)? Do you get the PM_EVT_FLASH_GARBAGE_COLLECTED event in pm_evt_handler()? Or the FDS_EVT_GC event in fds_evt_handler()?

    Could you run fds_stat() and see what it returns?

    The ATT MTU request is probably done by the GATT module, I think you can just comment out gatt_init() in main() and this will not happen.

    I'm guessing that the BLE_GATTS_EVT_WRITE is a write to a CCCD? Every time a CCCD is written to, the PM_PEER_DATA_ID_GATT_LOCAL_V2 record will be updated. This means that a new record will be written, and the old should be invalidated. Then when doing GC this invalidated record should be collected.

    Would it be possible for you to upload the complete projects so I can test them here?

  • Hi Petter - thanks for your help. Yes I am writing to (several) CCCDs immediately when the connection is established so my central can receive notifications. A few updates.

    First I tried commenting out gatt_init from main as you suggested and this makes the problem go away - the flash writes simply don't happen anymore. I'm not sure if there are other implications with doing that though?

    Secondly I implemented some logging around the PM_EVT_STORAGE_FULL event. Calling fds_gc() returns zero but the GC does not actually complete - there are still over a hundred dirty pages when I run fds_stat() immediately after. Interestingly I also added some logging to PM_EVT_LOCAL_DB_CACHE_APPLIED so I could see the cache filling up and the problem went away - GC did clear out the dirty records. The delay associated with the extra logging clearly helped so I think this must be a race condition - perhaps the previous flash write is not completing before fds_gc() is executed?

  • Hmm. I didn't expect the writes to go away by commenting out gatt_init(). I expected MTU exchange to go away. Which central example are using? Maybe the write is trigger by the MTU exchange or something.

    If you don't want to use the GATT module you should also remove the files from the project, and/or disable the module in sdk_config.h see NRF_BLE_GATT_ENABLED. Implications will be that some events will not be handled anymore, see nrf_ble_gatt_on_ble_evt().

    When you run fds_stat() immediately after calling fds_gc()? That is expected, the GC is not complete until you get PM_EVT_FLASH_GARBAGE_COLLECTED and FDS_EVT_GC. Do you get these events? What do fds_stat() return then?

    I'm not sure why the logging changes anything. Would it be possible to upload your complete projects? And tell me how I can reproduce this? And which logging removes removes the issue?

  • Unfortunately my central is an unpublished iOS app so although I'd be happy to share the project with you I don't think that's going to be much help without the app to test with.

    However, I think I may now know what is happening. I do get PM_EVT_FLASH_GARBAGE_COLLECTED and FDS_EVT_GC after the first CCCD write and fds_stat() shows the dirty records have gone at that point. However, I can see three flash writes going out before this happens (as my app requests notification on three CCCDs at connection). My guess is that we have contention here on the flash buffer - the CCCD writes are coming in while the GC process is still in flight and this is somehow screwing up the FDS state and causing problems . Is that plausible?

    Out of interest why is persistent storage for the CCCD state required? Am I doing the right thing by writing to the CCCD each time the app connects to ensure I get notifications or should I be reading the state and only updating if notifications are not already enabled?

  • Ok, so this is confirmed, if I limit my app to only writing to one CCCD at connection time then everything works as expected - if the flash buffer is full GC completes fine and everything runs normally. So the issue is that the peer manager/FDS is not correctly managing the situation where multiple CCCD writes are made simultaneously and FDS needs to do garbage collection to store their values in the local cache.

    I also note that if I disable the gatt module my peripheral no longer receives BLE_GATTS_EVT_WRITE events - so turning it off seems like a bad idea!

Related