FDS_EVT_GC results in NRF_SUCCESS but with remaining freeable words

Hello, 

We have nRF52832 in production on a custom board. Recently, we created a new functionality with the aim to handle flash more efficiently, and to avoid the case where peer manager will not have enough space and it will erase bonds. So, every time before saving any kind of record, words_used + new record size is checked that is less than the determined limit and then saving in flash is allowed. We also call two times a day the fds_gc, as long as there is no BLE connection at the time. 

The problem is that from many clients devices we receive logs that we have set in order to track gc behaviour. This kind of log is set after FDS_EVT_GC and if this event returns NRF_SUCCESS. In this case we call fds_stat to witness the result of garbage collector, but unfortunately, it returns the same number of freeable words at every gc call. We have this log from many users for days, and we do not know why this happens, and why it doesn't delete all freeable words but still returns nrf_success? 
When I say the same number I mean same number again and again for each user, e.g. one user has 1458, another 2012, etc. 

Why does this happen?
How can we solve this problem? 

Thank you in advance.

Dimitra

Parents
  • Hi,

    Which nRF5SDK version are you using?

    Would it be possible to share a minimal sample that would reproduce this issue? Alternatively share a flash dump before and after garbage collection using:

    nrfjprog --readcode dump.hex

    I can make the case private first if you prefer it. 

    regards

    Jared

  • Hello, 

    We use nRF5 SDK 17.1.0, softdevice s132, and SEGGER Embedded Studio. 

    I am currently running tests to reproduce it myself, as I said we have seen it mostly on release on client's devices in which we have some logs to check the functionality remotely. 
    As soon as I confirm how to reproduce the problem, I will share it with you. 

    In the meanwhile, could you please explain to me what you mean by flash dump and how to do what you describe with this spesific command too? (I haven't used command line, I edit on SEGGER and erase/write hex through nrf Connect app programmer). 

    Regards, 
    Dimitra

  • DimitraN said:
    In the meanwhile, could you please explain to me what you mean by flash dump and how to do what you describe with this spesific command too? (I haven't used command line, I edit on SEGGER and erase/write hex through nrf Connect app programmer). 

    nRF Command Line tools is a stand alone executable that contains several options to communicate with the debug interface on the nRF52. One of these commands --readcode allows you to read the flash on the nRF52 and stores the content in a file. You can find documentation for it here.

  • Hello, 

    I have been running tests to reproduce the log but I am still not sure what causes it. 

    Let me explain more:
    We have made a test function that saves multiple fake samples to force flash getting full in order to test the operation when it gets full. 

    Two times I saw this error with GC:
    1. I did this test , the flash got full, meaning we stop attempting to save anything in flash, we call GC in case there are freeable words, we free up some space, GC worked fine, and then we continue to save some more records, until there is nothing to delete and there are no freeable. 
    Then, I powered down my board, trying to see if running out of battery might be relevant to the case. 
    Turned the board back on, and connected with mobile to collect data. When we collect data, the records are deleted from flash. After getting all data, I disconnected, and GC is called which now had over 22000 words to free. The result was unfortunately, that GC finished successfully with 2040 freeable not erased. 
    I did hard reset and GC was called again and this time it resulted in 0 freeable words. 

    2. The second time the bug occurred I have no idea why it occurred. I was repeating this test, and when the flash got full and GC was called it returned successfully with 528 freeable words not erased, before I turn off the power... And I don't see anything different on the logs that would help me understand why now it occurred earlier. 
    I connected to collect data, after receiving them, GC was called again and now it had 2040 freeable left in memory again... 

    Do you think I should do soft-reset when it gc returns SUCCESS but there are freeable remaining records?

    Finally, I tried running the command you told me but I get this: 
    [error] [ Worker] - BinaryImage access error, could not open dump.hex.
    [error] [ Client] - Encountered error -162: Command read_to_file executed for 6401 milliseconds with result -162
    ERROR: The file specified is not a valid hex file, has data outside valid areas
    ERROR: or does not have data in valid areas.
    NOTE: For additional output, try running again with logging enabled (--log).
    NOTE: Any generated log error messages will be displayed.

    What am I doing wrong?

    If I run device version it seems connected... 
    PS C:\Program Files\Nordic Semiconductor\nrf-command-line-tools> nrfjprog --deviceversion
    682245179 - NRF52832_xxAA_REV1

    Thank you for your support.

  • Hi,

    Can you double check that you have read write access in the path that you are trying to run that command from? Alternatively, try changing the filename to something else. 

    regards

    Jared 

  • Ok, I run a test where the error occurred because I turned off power after flash was full. 
    Here are the files: 
    before1.hex

    after.hex

    I hope it helps. 

    Do you think I should do soft-reset when it gc returns SUCCESS but there are freeable remaining records?

    If you cannot find something that would help me, do you think it is ok to do soft reset when it happens?

    Thank you
    Dimitra

  • Hi,

    Similar to your observation, based on your flash dump, I can see that the GC for some reason is not performed 100%. This can happen if a record fail CRC. Do you have CRC enabled and could you try to disable it and see if you're still able to reproduce this issue?

    DimitraN said:
    Do you think I should do soft-reset when it gc returns SUCCESS but there are freeable remaining records?

    If you cannot find something that would help me, do you think it is ok to do soft reset when it happens?

    I don't see how a soft-reset or any reset would have any effect on the garbage collection after the fds event handler has retuned with a FDS_EVT_GC.

    regards

    Jared 

Reply
  • Hi,

    Similar to your observation, based on your flash dump, I can see that the GC for some reason is not performed 100%. This can happen if a record fail CRC. Do you have CRC enabled and could you try to disable it and see if you're still able to reproduce this issue?

    DimitraN said:
    Do you think I should do soft-reset when it gc returns SUCCESS but there are freeable remaining records?

    If you cannot find something that would help me, do you think it is ok to do soft reset when it happens?

    I don't see how a soft-reset or any reset would have any effect on the garbage collection after the fds event handler has retuned with a FDS_EVT_GC.

    regards

    Jared 

Children
  • In my sdk_config.h file I have these settings:

    FDS_CRC_CHECK_ON_READ 0
    FDS_CRC_CHECK_ON_WRITE 0
    CRC16_ENABLED 1
    CRC32_ENABLED 0

    Is this the one you are talking about or there is another configuration missing?
    I will disable CRC16_ENABLED and try to reproduce it. 

    I don't see how a soft-reset or any reset would have any effect on the garbage collection after the fds event handler has retuned with a FDS_EVT_GC.

    I was thinking about calling fds_gc again after a reset which would hopefully work normally, but maybe it is not a safe approach..


    Kind regards, 
    Dimitra

  • Hi,

    Yes and set FDS_CRC_CHECK_ON_READ to 0.

    DimitraN said:
    I was thinking about calling fds_gc again after a reset which would hopefully work normally, but maybe it is not a safe approach..

    I'm not sure if it has any effect but I can't see how calling gc after a reset would effect the flash negatively except for the added time. What's important is that you wait for the FDS events to be generated before you reset as the module is asynchronous. 

    regards
    Jared

  • Hello, 

    It looks like it didn't work.. 

    I run the test again and after getting full flash I turned off the power in order to try to reproduce the error. I turned on the power again, connected, collected data and then gc was called and returned success with 2002 freeable words. 

    But this time I noticed that after collecting all data, peer manager tried to clean flash. So could it be maybe that I call gc at the same time peer manager does or right after and there is a conflict or something? 

    		// mobile is connected
    00> <info> app: No more records were found to send!  Valid Measurements: 0
    		// sampling
    00> <info> app: Event: FDS_EVT_UPDATE received (NRF_SUCCESS)
    00> <info> peer_manager_handler: Peer data updated in flash: peer_id: 0, data_id: Local database, action: Update
    00> <info> app: Successfully wrote Unix time: 1739524052
    00> <info> app: Gmt: 2
    00> <info> app: Event: FDS_EVT_UPDATE received (NRF_SUCCESS)
    00> <info> app: ## FULL FLASH ##		// log that states full flash state - records were just deleted, in some seconds this goes away
    00> <info> app: Event: FDS_EVT_UPDATE received (NRF_SUCCESS)
    00> <info> peer_manager_handler: Peer data updated in flash: peer_id: 0, data_id: Local database, action: Update
    00> <info> app: Valid Measurements: 0. Errors saved: 0
    00> <info> app: Event: FDS_EVT_UPDATE received (NRF_SUCCESS)
    00> <info> peer_manager_handler: Peer data updated in flash: peer_id: 0, data_id: Local database, action: Update
    00> <info> app: Disconnected, reason 0x13.		
    		// mobile diconnected: this means that GC is called by me at this point if there are freeable words
    00> Logs dropped (9)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> Logs dropped (9)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (9)
    00> <warning> peer_manager_handler: Flash garbage collection failed with error Unknown error code.
    00> Logs dropped (9)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> Logs dropped (9)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (9)
    00> <warning> peer_manager_handler: Flating to clean flash.
    00> Logs dropped (9)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (9)
    00> <warning> peer_manager_handler: Flash garbage collection failed with error Unknown error code.
    00> Logs dropped (9)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> Logs dropped (9)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> ting to clean flash.
    00> Logs dropped (9)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (9)
    00> <warning> peer_manager_handler: Flash garbage collection failed with error Unknown error code.
    00> Logs dropped (9)
    00> <info> peer_manager_handler: Attempash garbage collection failed with error Unknown error code.
    00> Logs dropped (10)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> Logs dropped (10)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (10)
    00> <warning> peer_manager_handler: Flash garbage collection failed with error Unknown error code.
    00> Logs dropped (10)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> ived (FDS_ERR_OPERATION_TIMEOUT)
    00> ash garbage collection failed with error Unknown error code.
    00> Logs dropped (10)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> Logs dropped (10)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (10)
    00> <warning> peer_manager_handler: Flash garbage collection failed with error Unknown error code.
    00> Logs dropped (10)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> ived (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (10)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> Logs dropped (10)
    00> <info> app: Event: FDS_EVT_GC received (FDS_ERR_OPERATION_TIMEOUT)
    00> Logs dropped (10)
    00> <warning> peer_manager_handler: Flash garbage collection failed with error Unknown error code.
    00> Logs dropped (10)
    00> <info> peer_manager_handler: Attempting to clean flash.
    00> Logs dropped (10)
    00> <info> app: Event: FDS_EVT_GC rece<info> app: Fast advertising.
    00> <info> app: Event: FDS_EVT_GC received (NRF_SUCCESS)
    		// below logs: (freeable_words, words_used)
    00> <info> app: AFTER GC STAT: (2002, 2092)  (origin 4)
    00> <info> app: AFTER GC STAT UNDER LIMIT: (2002, 2092)
    		// sampling continues... 
    00> <info> app: No 2. call of GC	// GC is called again because of timestamp (this happens at midnight)
    00> <info> app: Unix limit is: 1739570400
    00> <info> app: Event: FDS_EVT_UPDATE received (NRF_SUCCESS)
    00> <info> app: Event: FDS_EVT_GC received (NRF_SUCCESS)
    00> <info> app: AFTER GC STAT: (2008, 2098)  (origin 4)
    00> <info> app: AFTER GC STAT UNDER LIMIT: (2008, 2098)
    		// sampling
    00> <info> app: Event: FDS_EVT_UPDATE received (NRF_SUCCESS)
    		// sampling
    00> <info> app: Event: FDS_EVT_UPDATE received (NRF_SUCCESS)


    Is there a way that I have knowledge that gc has been already called from peer manager and so I do not call it again while it is running ? 
    Or perhaps this happens if GC is called while there is another active action in flash like writing/updating a record? 

    Finally, could you explain me more about CRC? Why should I set it enabled or disabled? and now that it was reproduced should I enable CRC16_ENABLED again?

    Kind regards, 
    Dimitra
     

  • Hi,

    DimitraN said:
    Is there a way that I have knowledge that gc has been already called from peer manager and so I do not call it again while it is running ? 

    I don't think that is possible with the peer manager. 

    DimitraN said:
    Is there a way that I have knowledge that gc has been already called from peer manager and so I do not call it again while it is running ? 
    Or perhaps this happens if GC is called while there is another active action in flash like writing/updating a record? 

    Essentially you should only call the garbage collection once the filesystem is full to minimize flash wear. 

    Another plausible reason for the GC skipping pages is if the record is already open when the garbage collection is started.

    From the documentation for freeable words:

        * The actual amount of space freed by garbage collection might be less than this value if
        * records are open while garbage collection is run.

    DimitraN said:
    Finally, could you explain me more about CRC? Why should I set it enabled or disabled? and now that it was reproduced should I enable CRC16_ENABLED again?

    It's an extra added layer to enable integrity of data to ensure that the data hasn't been corrupted. While we debug this issue, you can have it disabled. 

    regards

    Jared 

  • Hello, 

    Another plausible reason for the GC skipping pages is if the record is already open when the garbage collection is started.

    Usually the number of freeable left is a lot larger than a record, but it is close to the size of a virtual page which is 2048 words. 
    I found this at fds.h at function fds_record_close

    * Closing a record allows garbage collection to run on the virtual page in which the record is
    * stored (if no other records remain open on that page).

    I think I found what the problem was! 
    After a reset (or a power-up) I call some function to retrieve data from flash to set some variables. There we didn't close one record so the GC would not run on its page. 
    Probably this happens to clients that leave the device for a long time without charge and it eventually turns off.
    Also I had a call of GC right after updating another record, which probably make them run simultaneously or at least GC might run before finishing the update and maybe this causes the problem again. 
    I have confirmed at least twice that this error has not occured after correcting the above . 

    Thank you very much for your help! 

    Regards, 
    Dimitra 

Related