ZMS Data Loss — ATE ID 0x1A80 Overwritten

Environment: nRF54 SDK v3.0.0, Zephyr Memory Storage (ZMS)
Config: 4 sectors × 4KB = 16 KB partition
Record IDs involved: 0x1A80 (SRAP data, 253 bytes), 0x1A21 (pairing info, 408 bytes)
Observed behavior: Data is correct after power-on. After some runtime, 0x1A80 reads back corrupted data.
Flash Dump — Sector 0 ATE Entries (sector tail)
ATEs at the end of the first 4KB sector, ordered from newest to oldest (ZMS writes ATEs backward from the sector end):
Address  Content (hex)                                                        Interpretation
-------  -------------------------------------------------------------------  -------------------------------
4080     54 2E FF FF FF FF FF FF 00 00 00 00 01 42 00 00                      Empty ATE (cycle=0x2E)
4064     67 2D 00 00 FF FF FF FF A0 0D 00 00 FF FF FF FF                      ZMS_HEAD_ID (cycle=0x2D, offset=3488)
4048     D0 2E FD 00 80 1A 00 00 00 00 00 00 00 00 00 00                      → ID=0x1A80, len=253, **offset=0**, cycle=0x2E
4032     C7 2E 04 00 05 1A 00 00 01 00 00 00 00 00 00 00                      → ID=0x1A05, len=4 (inline data)
4016     F1 2E 00 00 FF FF FF FF 00 01 00 00 FF FF FF FF                      ZMS_HEAD_ID (cycle=0x2E, **offset=256**)
4000     A1 2E 98 01 21 1A 00 00 00 00 00 00 00 00 00 00                      → ID=0x1A21, len=408, **offset=0**, cycle=0x2E
ATE structure (16 bytes): crc8(1) + cycle_cnt(1) + len(2) + id(4) + data/offset(8) — little-endian.
Observations
  1. Both 0x1A80 and 0x1A21 have offset=0 — both point to sector base address 0 as their data location. Since 0x1A21 (408 bytes) was written after 0x1A80 (253 bytes), 0x1A21's data physically overwrites 0x1A80's data at offset 0.
  2. The ZMS_HEAD_ID entry at offset 4016 records offset=256 — this is a GC-done marker written when data_wra was at position 256 within the sector. However, the subsequent 0x1A21 ATE at offset 4000 records offset=0, meaning data_wra was reset from 256 back to 0 between these two writes.
  3. The ZMS_HEAD_ID at offset 4064 has cycle=0x2D (different from the 0x2E cycle of all other entries) — this is a residual entry from a previous sector cycle that was not fully erased.
  4. In the second sector (lines 258-513 of the dump), the sector has been GC'd/erased. Although a 0x1A80 ATE entry still exists at this location:
08 2D FD 00 80 1A 00 00 00 00 00 00 00 00 00 00    → ID=0x1A80, len=253, **offset=0**, cycle=0x2D
The actual data at the offset it points to (offset 0 of this sector) does not contain the real 0x1A80 payload — the data at that address belongs to a different record, confirming the data was overwritten there as well.
Question
Could you please help us understand what conditions in ZMS could cause the data_wra pointer to be reset back to 0 after a GC cycle has already advanced it (to 256 in this case), resulting in two different records pointing to the same data offset and overwriting each other? No power loss or system reset occurred during this failure window.
Parents
  • Hello,

    I suspect that this is patched in this commit:

    https://github.com/zephyrproject-rtos/zephyr/commit/15cbe9fd18e2a319811a3cd877f238543050da18

    Which is included in NCS v3.1.0.

    Do you have an application where I can reproduce the issue on an nRF54L15 DK? Alternatively, can you see if running your application in v3.1.0 fixes the issue, or if it is still present?

    Best regards,

    Edvin

  • Some more context. There was a bug where if a partition was mounted after it's initial mount, this could lead to a bug showing what you are seeing. But I am not 100% sure that is actually what you are seeing, which is why I would like to verify whether using NCS v3.1.0 has the same behavior or not.

    BR,
    Edvin

  • Hi,

    My colleague and I are working on a data corruption issue in ZMS together, and we now have a 100% reproducible test case.

    Setup: 3 sectors. Write ID=0x1A80 in the first sector, then keep writing other data entries. Manually monitor free space and trigger GC. After 2 GC cycles complete, power off and restart.

    After reboot, ID=0x1A80 has been correctly migrated to the active sector by GC — reading it returns the correct data. However, if we then write ONE more entry (any new data), the data at 0x1A80 gets corrupted/overwritten.

    We tested this specifically when the code path `gc_done_marker = true` is hit during recovery.

    Looking at the zms code, I noticed something in zms_recover_last_ate() that I think might be related:

    /* skip close and empty ATE */
    *addr -= 2 * fs->ate_size;

    The function is called during recovery with addr pointing to the close_ate position (sector_size - 2*ate_size). After subtracting 2*ate_size, the scan starts at sector_size - 4*ate_size. But this skips one valid ATE position at sector_size - 3*ate_size — right below the close_ate.

    Could this cause data_wra to be reconstructed incorrectly during recovery after a GC + power loss, so that the next data write ends up overwriting existing data? Would the correct fix be to change "-= 2" to "-= 1"?

    Would appreciate any insights.

    Thanks

  • It could be related, but I can't say 100% whether this will fix the issue. Can you try to run it in NCS v3.1.0 or later?

    BR,

    Edvin

  • Hi ,

    I haven’t tested this on NCS v3.1.0 yet. Adopting the change may involve additional evaluation and testing of some related interfaces.

    However, I compared the ZMS-related code between the currently used NCS v3.0.0 (which consistently reproduces the issue) and v3.1.0, and I did not observe any changes in this part.

    The test method described above can reproduce the problem 100% of the time, and after applying the proposed modification, the issue no longer occurs.

    Could you please take a look and let me know if this fix looks reasonable?

    Best regards,

    Pang

  • Hello,

    I am not convinced that would be the proper fix. I am more afraid of there being a bug in the state machine, where if a GC or recovery fails, or something like that. Perhaps changing the -2 to -1 masks away the error, but I would not say that it is a proper solution. Particularly since this is still the same value in the latest NCS release (v3.3.0), I wouldn't change that. 

    Can you try to apply this commit:

    https://github.com/zephyrproject-rtos/zephyr/commit/a5f0c965c5efa9018d7e899bb7e2a12dd5d22bb1

    And see if the issue still persists? Remember to revert the -1 to -2 in this snippet:
    /* skip close and empty ATE */
    *addr -= 2 * fs->ate_size;

Reply Children
  • Hi,

    Thank you for the suggestion. We tried adding the ebw_required check as you recommended, but unfortunately the data corruption still occurs 100% of the time with our test case.

    Just to clarify, the current code in NCS v3.0.0 already includes the byte-by-byte erase_value check in the while loop (lines 1249-1268 in zms_init). So our test was already running with that logic in place.

    The core issue we're seeing is that after recovery with gc_done_marker = true, writing one more entry overwrites existing data. We've confirmed this is directly related to zms_recover_last_ate() starting its scan at the wrong position.

    Here's the layout we consistently see in the active sector after GC + power loss:

    - sector_end - ate_size: empty_ate
    - sector_end - 2*ate_size: close_ate
    - sector_end - 3*ate_size: valid data ATE ← MISSED by zms_recover_last_ate
    - sector_end - 4*ate_size: scan starts here

    The function receives addr pointing to the close_ate position (sector_end - 2*ate_size), then does "*addr -= 2 * fs->ate_size", which places the scan start at sector_end - 4*ate_size. This skips the valid ATE at sector_end - 3*ate_size every time.

    Here is the relevant code in zms_recover_last_ate():

    /* skip close ATE */
    *addr -= 2 * fs->ate_size; // addr = sector_end - 2*ate_size (close_ate)
    // After -= 2: sector_end - 4*ate_size ← starts here
    // Skips sector_end - 3*ate_size ← MISSED!

    ate_end_addr = *addr;
    data_end_addr = *addr & ADDR_SECT_MASK;
    *data_wra = data_end_addr;

    while (ate_end_addr > data_end_addr) {
    // scan downward from sector_end - 4*ate_size
    // NEVER reaches sector_end - 3*ate_size
    }

    Here is the actual flash dump from the active sector after recovery with gc_done_marker = true:

    Address Content (hex) Interpretation
    ------ ----------------------------------------------------------------- -------------------------------
    4080 54 2E FF FF FF FF FF FF 00 00 00 00 01 42 00 00 Empty ATE (cycle=0x2E)
    4064 67 2D 00 00 FF FF FF FF A0 0D 00 00 FF FF FF FF close_ATE (cycle=0x2D, offset=3488)
    4048 D0 2E FD 00 80 1A 00 00 00 00 00 00 00 00 00 00 ID=0x1A80, len=253, offset=0, cycle=0x2E ← MISSED by scan
    4032 C7 2E 04 00 05 1A 00 00 01 00 00 00 00 00 00 00 ID=0x1A05, len=4 (inline), cycle=0x2E
    4016 F1 2E 00 00 FF FF FF FF 00 01 00 00 FF FF FF FF GC_DONE (cycle=0x2E, offset=256)
    4000 A1 2E 98 01 21 1A 00 00 00 00 00 00 00 00 00 00 ID=0x1A21, len=408, offset=0, cycle=0x2E

    Because the ATE at 4048 is missed, zms_recover_last_ate starts at 4032 and only finds the ATEs below it. The last data ATE it finds (0x1A21 at 4000) has offset=0, so data_wra gets reconstructed as aligned_size(408), completely ignoring that 0x1A80's data at offset=0 with len=253 already occupies that region. When the next write happens, data_wra points to the wrong location and overwrites existing data.

    When we change "-= 2" to "-= 1", the scan correctly includes the ATE at 4048, and writing one more entry no longer corrupts data.

    Could you please take another look at this? I understand the code hasn't changed in v3.3.0, but our reproducible test case consistently shows the data ATE at sector_end - 3*ate_size being missed, leading to incorrect data_wra reconstruction and subsequent data overwrite. The dump above clearly shows this happening.

    Thanks again for your time.

  • I see. Thank you for clarifying. It may be that this is a bug that is in fact not yet patched. However, I can't say without being able to reproduce the issue. Is it possible for you to give an application that reproduces the issue? You can strip out sensitive information if you prefer. 

    Best regards,

    Edvin

  • Got it. Here is the test case I used to reproduce the issue:

    I wrote a function that directly calls our ZMS library APIs.

    Setup:

    • Mount ZMS with 4 sectors.
    • Erase all sector contents before starting.

    Steps to reproduce:

    1. After mounting ZMS, write one entry with a length greater than 8 bytes.
    2. Trigger 3 GC cycles, then power off (to force a full remount on the next boot).
    3. After reboot, read the entry — the data is correct at this point.
    4. Write one more different entry.
    5. Result: the previously written entry is corrupted/overwritten.

    I also ran our real application using the same workflow, and the issue reproduces reliably there as well.

    Beat regards

    Pang

  • Can you upload the application that you are using to reproduce the issue please?

Related