little fs directory disappears after having used it for ~20-30 hours

Hi,

we are working on a nrf52840 based device, which has an external mx25r16 flash (via SPI), having 3 partitions for: secondary-image, mcu-scatch-partition and application-data-partition with litte-fs.

Our device is based on zephyr 2.4 (...knowing that this is quite old).

Recently we added a flash-file based event-logging feature to our application, where various events are logged to an increasing number of event-log-files (located in one log-subdirectory).

These event-log files are published to some cloud service one-by-one about two times per day and are deleted after successful publishing.

The implementation seems to work perfectly, when having a debug build with various LOG_INF()s.

As soon as running in release build we can observe, that on about 40% of our devices are not sending any event-log-files anymore after having run for about 20-30 hours, even while everything worked fine within the time between reboot and the first occurrence of the problem.

When digging into the problem, we found out, that all of the affected devices do not have the logging-subdirectory anymore, which was created on demand during the app-initialization after reboot.

We reviewed our application very carefully and are sure, that there is no line of application-code which might cause the delete of the logging-subdirectory.

Is there known any filesystem / little fs issue, which might cause this kind of problem?

Any help or suggestion is highly appreciated!

Volker

Parents
  • Hi Volker,

    I've not found any reports of similar issues for littlefs. It's also strange that you are not seeing this with your debug build. As a start, I would suggest that you 'diff' the generated .config files from your release and debug builds to see if there are other differences that may be relevant, apart from the logger configurations (heap sizes, etc.). I've also asked internally if anyone has suggestions on how you can troubleshoot this.

    Searching the release notes for "littlefs" gave the following results:

    /zephyr/doc/releases$ grep -r "littlefs"
    release-notes-3.3.rst:- :github:`52886` - tests: subsys: fs: littlefs: filesystem.littlefs.default and filesystem.littlefs.custom fails
    release-notes-3.3.rst:* :github:`52602` - tests: subsys: settings: file_littlefs: system.settings.file_littlefs.raw fails
    release-notes-2.0.rst:* File Systems: Added support for littlefs
    release-notes-2.0.rst:* :github:`18664` - [Coverity CID :203416]Uninitialized variables in /home/aasthagr/zephyrproject-external-coverity-new/zephyrproject/modules/fs/littlefs/lfs.c
    release-notes-2.0.rst:* :github:`18663` - [Coverity CID :203413]Null pointer dereferences in /home/aasthagr/zephyrproject-external-coverity-new/zephyrproject/modules/fs/littlefs/lfs.c
    release-notes-2.0.rst:* :github:`18458` - [Coverity CID :203422]Memory - illegal accesses in /tests/subsys/fs/littlefs/src/testfs_util.c
    release-notes-2.0.rst:* :github:`18392` - [Coverity CID :203494]Integer handling issues in /subsys/fs/littlefs_fs.c
    release-notes-2.0.rst:* :github:`5529` - Explore Little File System (littlefs) support
    release-notes-2.4.rst:* CVE-2020-13599: Security problem with settings and littlefs
    release-notes-2.4.rst:* :github:`28540` - littlefs: MPU FAULT and failed to run
    release-notes-2.4.rst:* :github:`26279` - littlefs: Unable to erase external flash.
    release-notes-2.4.rst:* :github:`25728` - [Coverity CID :210050] Unchecked return value in tests/subsys/settings/littlefs/src/settings_setup_littlefs.c
    release-notes-2.4.rst:* :github:`24111` - drivers: flash: littlefs: add sync to flash API & update LittleFS to use it
    release-notes-2.4.rst:* :github:`22340` - Security problem with settings and littlefs
    release-notes-3.5.rst:  * Added support of mounting littlefs on the block device from the shell/fs.
    release-notes-2.7.rst:* :github:`38202` - mbedtls and littlefs on a STM32L4
    release-notes-2.7.rst:* :github:`38059` - automount configuration in nrf52840dk_nrf52840.overlay causes error: mount point already exists!! in subsys/fs/littlefs sample
    release-notes-2.7.rst:* :github:`36851` - FS logging backend assumes littlefs
    release-notes-2.7.rst:* :github:`32990` - FS/littlefs: it is possible to write to already deleted file
    release-notes-3.2.rst:* :github:`50033` - tests: subsys: fs: littlefs: filesystem.littlefs.custom fails to build
    release-notes-2.1.rst:* :github:`18341` - settings: test setting FS back-end using littlefs
    release-notes-2.3.rst:* :github:`24585` - How to read/write an big(>16K) file in littlefs shell sample on native posix board?
    release-notes-3.0.rst:* :github:`41395` - littlefs(external spi flash) + mcuboot can't get right mount area
    release-notes-3.0.rst:* :github:`36962` - littlefs: Too small heap for file cache (again).
    release-notes-2.5.rst:* :github:`32078` - build error with llvm: samples/subsys/fs/littlefs
    release-notes-2.5.rst:* :github:`31669` - [Coverity CID :215715] Unchecked return value in tests/subsys/fs/littlefs/src/testfs_mount_flags.c
    release-notes-2.5.rst:* :github:`31524` - littlefs: Too small heap for file cache.
    release-notes-2.5.rst:* :github:`28309` - Sample/subsys/fs/littlefs with board=nucleo_f429zi  don't work
    release-notes-2.2.rst:* :github:`8242` - File system (littlefs & FAT) examples
    release-notes-3.1.rst:* :github:`43020` - samples/subsys/fs/littlefs does not work with native_posix board on WSL2
    

    Best regards,

    Vidar

  • Hi Vidar,

    thank you for your suggestions. The diff did not show anything relevant for this issue.

    I would assume that there is happening a race-condition at some point.

    When using a debug build, the timing might be slightly different and that might possibly cause that the problem does not occur anymore.

    Regards

    Volker

  • Hi Volker,

    I see there is a symbol named LFS_THREADSAFE in the littlefs implementation, maybe it would be worth trying to enable that? 

    https://github.com/zephyrproject-rtos/littlefs/commit/00a9ba7826318408d280aafe5dc527a43b2c965d 

    Regards,

    Vidar

  • I searched through our SDK tree (includes the v3.5.99 tag of our Zephyr fork), but I do not see any Kconfig symbol which can be used to enable the thread safe implementation. 

    You may try to include a newer revision of littlefs by changing the revision number in the Zephyr manifest followed by a 'west update': https://github.com/nrfconnect/sdk-zephyr/blob/41095df79d11e081ea96d150fbe3dbd93f73af6c/west.yml#L272 

    Regards,

    Vidar

  • Hi Vidar,
    in the codebase of the zephyr 2.4 I cannot find any LFS_THREADSAFE symbols.

    Therefore I am assuming that the lfs-implementation there is not thread safe.

    So I wrapped all the application code, which modifies the file system. The application code, which does any file-create/file-delete/or open-write-close operation is now within one thread. All open-write-close operations are done in an pseudo atomic manner to avoid the situation, where two files might be open at the same time.

    Surprisingly the problem with the disappearing directory still persists.

    Any more ideas?

    Regards,

    Volker

  • Hi Volker,

    Maybe you can try updating the littlefs revision in your west manifest. I'm not sure if this will require other integration changes to Zephyr.

    Vidar Berg said:
    You may try to include a newer revision of littlefs by changing the revision number in the Zephyr manifest followed by a 'west update': https://github.com/nrfconnect/sdk-zephyr/blob/41095df79d11e081ea96d150fbe3dbd93f73af6c/west.yml#L272 

    I have not received any suggestions for troubleshooting this yet, but I recommend that you also post this question on Discord (https://discord.com/invite/Ck7jw53nU2 ) to see if someone else has experienced this.

    Regards,

    Vidar

Reply Children
  • Hi Vidar,

    we are still struggeling with this problem.

    I have modified our application code to protect ALL file-access operations with an lock (previously I had only locked the write/delete operations), but the problem still occurs from time to time. Still having problems to reproduce this.

    In the meantime I have made an observation which might help to find a trace to the problem...

    I used a different build-configuration which does not use the external flash (connected via SPI), but uses the nrf52840 internal flash with an application storage partition there.

    The application repeatedly writes to the same file, with an incrementing number of events. This is done by totally replacing the file with new extended content. The write operation is done by single open-/write-/truncate-/close-calls.

    At some point this file is read and transmitted to some cloud service. During this publishing sequence there is repeatedly iterated through all the files of the event-logging directory with reading the oldest file, publishing to the cloud and deleting the file afterwards.

    When using the build-configuration with the internal flash partition, there is published a file-content, which does not contain all events, which must have been written to the file at this point. It seems that there is read a kind of older version of the file. Or there might be some zephyr internal write buffering queue, which does not finally write all data to file, as soon as the file-close operation is called ... and maybe the read-operation is not queued ... leading to a read of some outdated file content. This part of the problem seems to be reproduceable.  When repeating the publishing some time later, the file with the missing events gets published again with all the expected content.

    Do you know any kind of similar zephyr / little-fs issues?

    Regards,

    Volker

  • The previously described observation with the nrf-internal flash might have been caused by the very low number of blocks available in the internal partition.
    When using the zephyr 2.4 it seems that there is not returned any error-code if no more free blocks are available.

    In the meantime I am able to run our application on a zephyr 3.2 version. This version gives a clear indication that there are no more free blocks in the internal flash partition.

    Next step is to test the application on zephyr 3.2 on a bigger number of test-devices ... to see if the directories are disappearing again.

  • Thank you for the update. These are interesting findings. Please let me know if how it goes when testing your application with Zephyr 3.2. If that works, it may be worth reviewing the changelogs again.

  • In the meantime, there are running 17 testing-devices with the zephyr 3.2 version for several days without showing the problem. So it is quite likely that the problem was somewhere in little-fs or the surrounding zephyr-code.

    I tried to inspect the diff, but especially in the little-fs-component there are so many differences, that I was not able to track down the one, which might have caused our problem.

    I cannot believe that we are the first and only ones having this problem ...

  • This sounds very promising, thanks for the update. Yes, it is surprising that we have not found any other reports of this issue. Maybe there is something with the timing in your app that makes it easier to reproduce.

Related