Fsstorage randomly erasing

Our team has been using Fstorage for a long time to store customer data in the device for customization. At random times, a very few units get erased completely when Fstorage is executed. Similar to processing the erase, but not the write. All the registers are FF's

When the device is reprogrammed, it functions normally and to date, has never reoccurred on the same unit.  

Can you offer any insight as to what is happening? The device works on a coin cell battery and happens even if the battery is new. Since it is a very rare occurrence, we cannot duplicate it readily to try and debug. 

Parents
  • void update_data(uint8_t *device_data)
    {
    	ret_code_t rc;
        rc = nrf_fstorage_erase(&fstorage, 0x4c000, 1, NULL);
        APP_ERROR_CHECK(rc);
    	
        rc = nrf_fstorage_write(&fstorage, 0x4c000,device_data,data_size, NULL);
        APP_ERROR_CHECK(rc);
    
    }

    Here is how we are writing to the flash.

  • I have a "hunch" that you are possible doing an erase early in main()? In which case if you have some ripple/bounce on VDD when the coin cell battery is insterted it's not unlikely that the MCU have enough time to start an erase, but then the bounce of battery voltage cause the voltage to fall below minimum operating conditions. In such case it's a bit undefined what happens, it can cause the erase operation to "malfunction". Can you add a 100ms in start of main() to ensure that the battery is stable before you start to run the code (and in specific code related to flash operations)? If you have a bootloader, make sure to add this delay in the start of the bootloader.

    Kenneth

  • Slightly running out of ideas, typically an erase operation is a time consuming task that block all other MCU execution, do you have any indication that you may (in corner case) drawing a lot of current while the erase operation is ongoing? E.g. an erase operation can take up to 100ms, is is possible that you have LED's or other circuitry that may draw excessive current during this period of time, enough for the coin cell to drop below operating conditions?

    Kenneth

  • That is a good idea. We don't have LED's or high current consuming devices. The coin cell voltage to the MCU is regulated so even if the coin cell drops, the regulator holds the supply at 3V. We are at a loss too since it is rare and not repeatable.

  • I would suggest that the flash erase works fine and it is the flash write that fails and hence the flash appears to get erased unexpectedly. Without delving into the flash code (which should monitor VDD before a write) I would recommend monitoring both the VDD and the coin cell voltage before issuing the write as the flash erase may leave the voltage recovering from the power surge caused by the flash erase, which lowers the coin cell voltage input to the regulator (so less headroom, less available energy) and may even lower the regulator output. A simple workaround is to never follow a flash erase by an immediate flash write without allowing a Lithium Recovery Time - say delay at least 200mSecs, more if possible. The voltage dip on the coin cell will be visible on a 'scope. Why no error code? I doubt that the write function realises the flash chip blocked the write due to falling voltage; a read-back is required.

  • That is a great suggestion. We did just that experiment and VDD seems to be holding fairly well. The regulator is a boost with ultra-low dropout. I believe you are correct with regards to the erase function being executed and not the write. We are now thinking that erase / write function is being interrupted in the between the operations with another call to the erase / write function. The erase / write is inside a function which can be called from an interrupt. Could this be the problem?

  • Having the erase/write within an interrupt-callable function sounds risky without some kind of semaphore lock to prevent nested calls. One assumes you are using the softdevice version of fstorage; what is SD_MAX_WRITE set to, more than the data size? Note if not there is no safe return code indicating success; a read-back after write is required for that.

    "When using SoftDevice implementation, the data is written by several calls to sd_flash_write if the length of the data exceeds NRF_FSTORAGE_SD_MAX_WRITE_SIZE bytes. Only one event is sent upon completion

    Note
    The data to be written to flash must be kept in memory until the operation has terminated and an event is received."

    Also what version of SDK, and is FreeRTOS being used?

Reply
  • Having the erase/write within an interrupt-callable function sounds risky without some kind of semaphore lock to prevent nested calls. One assumes you are using the softdevice version of fstorage; what is SD_MAX_WRITE set to, more than the data size? Note if not there is no safe return code indicating success; a read-back after write is required for that.

    "When using SoftDevice implementation, the data is written by several calls to sd_flash_write if the length of the data exceeds NRF_FSTORAGE_SD_MAX_WRITE_SIZE bytes. Only one event is sent upon completion

    Note
    The data to be written to flash must be kept in memory until the operation has terminated and an event is received."

    Also what version of SDK, and is FreeRTOS being used?

Children
  • There is a lock to prevent nested calls. I was concerned with just having it inside the interrupt. We're using 15.1.0 and not using FreeRTOS. One thing we did was overload the MCU with excessive erase/writes deliberately and it did cause the issue. Increasing the NRF_FSTORAGE_SD_QUEUE_SIZE from 4 to 5 solved it. SD_MAX_WRITE  is 4096 and we are not even sending half of that so I think we are good.

    I guess my next question is if it is OK to increase  NRF_FSTORAGE_SD_QUEUE_SIZE a bit more to be safe. Any adverse effects on doing that?

  • Don't see why not, unless Nordic team have input on this. All examples in all SDK releases up to 17.1 use only 4, not sure why.

    set OutputFile=f.txt
    for /R %%f in (*.c *.h *.asm *.s *.src *.eww *.ewp *.uvmpw *.uvproj *.xml *.emProject) DO type %%f | find /I "NRF_FSTORAGE_SD_QUEUE_SIZE" >> %OutputFile%

    #define DFU_FLASH_OPERATION_OP_ENTRIES NRF_FSTORAGE_SD_QUEUE_SIZE
    NRF_ATFIFO_DEF(m_fifo, nrf_fstorage_sd_op_t, NRF_FSTORAGE_SD_QUEUE_SIZE);
    // <o> NRF_FSTORAGE_SD_QUEUE_SIZE - Size of the internal queue of operations 
    #ifndef NRF_FSTORAGE_SD_QUEUE_SIZE
    #define NRF_FSTORAGE_SD_QUEUE_SIZE 4

    The only use I see anywhere of NRF_FSTORAGE_SD_QUEUE_SIZE in the entire SDK is in NRF_ATFIFO_DEF

  • I changed it and so far, so good. I will update if it happens again. I think there was just some kind of overload in the Queue and the flash operations didn't execute all the way.

Related