One of our devices is bricked due to the fds_init returning FDS_ERR_NO_PAGES. After flash analysis, I have discovered that all pages are marked as FDS_PAGE_DATA (no FDS_PAGE_SWAP). One of these pages is erased (just FDS_PAGE_TAG_MAGIC and FDS_PAGE_DATA header).
In our application, FDS is heavily used. During testing, we perform many power cycles. I do not have any clear reproduction path due to the fact that we found it just once (more than 200 devices online for a few months).
We are using SDK 14.2.0 with nRF52832 (custom board designs).
FDS flash dump can be found here: https://drive.google.com/drive/folders/1y_KOyIhVw9d-ZAAIc8SSa8k9vUOobDVz?usp=sharing
I see that there are some regions missing from your flashdump. Also, I see that the last data page is moved to a weird location, not page aligned. Do you have any idea of what is present in the flash in the area that is missing?
That is between 0x00073FFF and 0x0007544B. It should have been included "between" line 1025 and 1026 in the flashdump. especially interesting is what that is present on 0x00074000, since this should be the start of a page.
How many FDS pages do you use in your application? if there are more than 5, then 0x00075000 is also interresting. Actually, for N flash pages 0x0007(N+1)000 is interresting. The start of each page. Also, as I mentioned, I don't understand why DEADC0DE (DEC0ADDE) is present in a non-page-aligned place. Do you have a record where the actual data is DEADC0DE?
I see the internal ticket that Jørgen created from your original post here on DevZone. There is still no conclusion. I also see from one of the other tickets that you created that you have some theory on why you may have a missing swap page. It is an interesting theory, but without seeing what in the missing part of the flashdump, it is difficult to say whether this actually is the issue in your case.
Also, you say that the chip has been running for a few months. Do you have any estimation on how much and how often you write and erase your flash? Or in other words, how often you perform the Garbage Collection?
Sorry, this is probably my mistake. I sent you too much.We use 5 fds_pages. FLASH_FDS (rw) : ORIGIN = 0x0006F000, LENGTH = 0x5000Address 0x74000 is bootloader's data.
I have uploaded decoded fds_dump https://drive.google.com/file/d/1R7gSYENEmZZ3l5Fyh95nZNbv7FaCW6TW/view?usp=sharing. Sorry for this mistake.
This issue is not connected with my other 2 tickets. One of them will result in corrupted Record ID and the second with FDS_PAGE_UNDEFINED. In both cases, I believe that I found a root cause and I need more review than help.
In this case, we have 5 times FDS_PAGE_DATA. I do not know why it happened.
This corrupted node had exactly 7 runs of GC. Usually tested nodes have between 4-10 runs of GC.
I have been looking at this ticket a lot today, together with Terje, who is handling one of the other tickets.
I have looked through a lot of documentation, done some testing by erasing pages, writing the magic word, but no tag and so on, but I have not found a way to reproduce ending up with all data pages and no swap page, which is what you are facing with this node.
So I have to ask, have you done any changes to the FDS module before programming these devices? Any fixes that were implemented beforehand that can affect the FDS behavior in any way?
That being said, it looks like somehow, the swap page has been tagged as a data page. In this page it is the first FDS page, at 0007D000.
A suggested patch could be that if you end up with the NO_SWAP return from fds_init(), anything is better than just falling into APP_ERROR_CHECK(NO_SWAP);
As a workaround you can go through your FDS pages, and look for a page that contains the magic word and the data tag but no other data (all 0xFFFF). If you find one of these in the NO_SWAP state, it would be safe to assume that this page can be deleted. You can delete this page, then run fds_init() again. Alternatively, an NVIC_SystemReset();
If the FDS starts up with one empty page and the rest data pages, it will make the empty page to a SWAP page (I have tested this).
What would be interesting is to see what happens if you erase this first page (from 0x0007D000) and reboot the application. Do you get any other errors at this point? Are the rest of the records valid, or do you get any information saying parts of the rest of the FDS are corrupted?
Hi,I started to patch fds last month as a result of these tests. These devices are not affected by my changes. I am pretty sure that your solution will work as a workaround. Anyway, there is still a bug somewhere. I am afraid that it may manifest itself in a different way (for example some operation on 0x0007D000 might be processed before reboot). In such case, this workaround will not be working.
All records seem to be fine. Actually, I have erased this page using jlink and after reset device started to work.
I understand. However, I would say that anything is better than just passing the NO_SWAP into APP_ERROR_CHECK, which will preform a reset, run into the same check after reboot, and continue in that loop.
Of course, not knowing what caused the issue in the first place is a bit worrying. If it is caused by a power cut during a certain point of Garbage Collection (GC), then you will probably have the situation where you have a data page that is blank. (However, I don't see any specific reason why this would have happened due to a power cut during GC.)
But if you want a workaround, it needs to end up getting a new swap page. Maybe you can modify the GC function to copy all valid (not dirty/marked as deleted) records from e.g. the first data page to the current data page. That is, have a special GC function that empties all the records on the first data page. When finished, delete this page and mark it as a swap page.