ZMS Data Loss — ATE ID 0x1A80 Overwritten

Environment: nRF54 SDK v3.0.0, Zephyr Memory Storage (ZMS)
Config: 4 sectors × 4KB = 16 KB partition
Record IDs involved: 0x1A80 (SRAP data, 253 bytes), 0x1A21 (pairing info, 408 bytes)
Observed behavior: Data is correct after power-on. After some runtime, 0x1A80 reads back corrupted data.
Flash Dump — Sector 0 ATE Entries (sector tail)
ATEs at the end of the first 4KB sector, ordered from newest to oldest (ZMS writes ATEs backward from the sector end):
Address  Content (hex)                                                        Interpretation
-------  -------------------------------------------------------------------  -------------------------------
4080     54 2E FF FF FF FF FF FF 00 00 00 00 01 42 00 00                      Empty ATE (cycle=0x2E)
4064     67 2D 00 00 FF FF FF FF A0 0D 00 00 FF FF FF FF                      ZMS_HEAD_ID (cycle=0x2D, offset=3488)
4048     D0 2E FD 00 80 1A 00 00 00 00 00 00 00 00 00 00                      → ID=0x1A80, len=253, **offset=0**, cycle=0x2E
4032     C7 2E 04 00 05 1A 00 00 01 00 00 00 00 00 00 00                      → ID=0x1A05, len=4 (inline data)
4016     F1 2E 00 00 FF FF FF FF 00 01 00 00 FF FF FF FF                      ZMS_HEAD_ID (cycle=0x2E, **offset=256**)
4000     A1 2E 98 01 21 1A 00 00 00 00 00 00 00 00 00 00                      → ID=0x1A21, len=408, **offset=0**, cycle=0x2E
ATE structure (16 bytes): crc8(1) + cycle_cnt(1) + len(2) + id(4) + data/offset(8) — little-endian.
Observations
  1. Both 0x1A80 and 0x1A21 have offset=0 — both point to sector base address 0 as their data location. Since 0x1A21 (408 bytes) was written after 0x1A80 (253 bytes), 0x1A21's data physically overwrites 0x1A80's data at offset 0.
  2. The ZMS_HEAD_ID entry at offset 4016 records offset=256 — this is a GC-done marker written when data_wra was at position 256 within the sector. However, the subsequent 0x1A21 ATE at offset 4000 records offset=0, meaning data_wra was reset from 256 back to 0 between these two writes.
  3. The ZMS_HEAD_ID at offset 4064 has cycle=0x2D (different from the 0x2E cycle of all other entries) — this is a residual entry from a previous sector cycle that was not fully erased.
  4. In the second sector (lines 258-513 of the dump), the sector has been GC'd/erased. Although a 0x1A80 ATE entry still exists at this location:
08 2D FD 00 80 1A 00 00 00 00 00 00 00 00 00 00    → ID=0x1A80, len=253, **offset=0**, cycle=0x2D
The actual data at the offset it points to (offset 0 of this sector) does not contain the real 0x1A80 payload — the data at that address belongs to a different record, confirming the data was overwritten there as well.
Question
Could you please help us understand what conditions in ZMS could cause the data_wra pointer to be reset back to 0 after a GC cycle has already advanced it (to 256 in this case), resulting in two different records pointing to the same data offset and overwriting each other? No power loss or system reset occurred during this failure window.
  • Hello,

    I suspect that this is patched in this commit:

    https://github.com/zephyrproject-rtos/zephyr/commit/15cbe9fd18e2a319811a3cd877f238543050da18

    Which is included in NCS v3.1.0.

    Do you have an application where I can reproduce the issue on an nRF54L15 DK? Alternatively, can you see if running your application in v3.1.0 fixes the issue, or if it is still present?

    Best regards,

    Edvin

  • Some more context. There was a bug where if a partition was mounted after it's initial mount, this could lead to a bug showing what you are seeing. But I am not 100% sure that is actually what you are seeing, which is why I would like to verify whether using NCS v3.1.0 has the same behavior or not.

    BR,
    Edvin

  • Hi,

    My colleague and I are working on a data corruption issue in ZMS together, and we now have a 100% reproducible test case.

    Setup: 3 sectors. Write ID=0x1A80 in the first sector, then keep writing other data entries. Manually monitor free space and trigger GC. After 2 GC cycles complete, power off and restart.

    After reboot, ID=0x1A80 has been correctly migrated to the active sector by GC — reading it returns the correct data. However, if we then write ONE more entry (any new data), the data at 0x1A80 gets corrupted/overwritten.

    We tested this specifically when the code path `gc_done_marker = true` is hit during recovery.

    Looking at the zms code, I noticed something in zms_recover_last_ate() that I think might be related:

    /* skip close and empty ATE */
    *addr -= 2 * fs->ate_size;

    The function is called during recovery with addr pointing to the close_ate position (sector_size - 2*ate_size). After subtracting 2*ate_size, the scan starts at sector_size - 4*ate_size. But this skips one valid ATE position at sector_size - 3*ate_size — right below the close_ate.

    Could this cause data_wra to be reconstructed incorrectly during recovery after a GC + power loss, so that the next data write ends up overwriting existing data? Would the correct fix be to change "-= 2" to "-= 1"?

    Would appreciate any insights.

    Thanks

  • It could be related, but I can't say 100% whether this will fix the issue. Can you try to run it in NCS v3.1.0 or later?

    BR,

    Edvin

  • Hi ,

    I haven’t tested this on NCS v3.1.0 yet. Adopting the change may involve additional evaluation and testing of some related interfaces.

    However, I compared the ZMS-related code between the currently used NCS v3.0.0 (which consistently reproduces the issue) and v3.1.0, and I did not observe any changes in this part.

    The test method described above can reproduce the problem 100% of the time, and after applying the proposed modification, the issue no longer occurs.

    Could you please take a look and let me know if this fix looks reasonable?

    Best regards,

    Pang

Related