Beware that this post is related to an SDK in maintenance mode
More Info: Consider nRF Connect SDK for new designs
This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

FDS_GC causes hard fault

I'm using SDK 16.0, Soft device 340 v6.1.1, chip nRF52833. Currently no bootloader loaded, but I do use the secure bootloader with light mods. BLE stack and ANT stack are disabled when this occurs. System is basically idle.

This is quite odd. This has been working fine up until this morning, and now I have not been able to solve it. As soon as fds_gc is called, I get a hard fault. I'm using fds with the sd backend. The writes/updates all work fine, but as soon as I call gc, fault.

The PC reported in app_error_weak is 0x5450, which is definitely in the softdevice. The id is 1, so soft device assert failed.

[I read a post that said the actual calling function was in the stack, SP + 0x14. I checked that, and the address is also in SD space: 0x2C331] 

Can anyone tell me what assert is at 0x5450(or 0x2C331)? I've been using this setup for multiple projects, and no idea why this one just went south. I'm sure it is something dumb, but beating my head against the wall. Just knowing what assert failed would probably give me the answer.

I'm not writing that much data. I have two records, one is settings(40 bytes) and one is some statistics(72 bytes). All aligned properly. The records initialize fine. When the first statistic gets incremented, and the system is idle it attempts to update the record. This is successful. In the event call for the fds, on success a flag is set that gc is needed. When the system is idle, it tries to call gc. Instant hard fault.

Relevant SDK configs:

virtual pages: 5

virtual page size: 1024

using sd backend

fstorage max write size: 1024

fstorage max retries: 16

max users: 4 (only using 1 though)

Appreciate any insight!

Parents
  • OK, some quick corrections and updates. 

    First, the error was reported in the hard fault handler, not app error. Wouldn't let me edit.

    Second, I realized that I had the wrong soft device. I use a build script, and I originally saw the problem on a device after deploy with the build script. I then did a bunch of testing searching for the problem, using the wrong version of the soft device. The script was using v7.01, but I was trying to debug with 6.1.1. I just switched the debug back to v7.01, and it works.

    Then I ran the build script again, and it worked. I am so confused.

    I'm still curious if anyone tell me what the assert was that failed, in v6.1.1. I'm worried I have a race condition or something. We are close to going to production, and this is not the time to find this type of problem. 

    I'm going to leave this open for now, hoping someone from Nordic wants to chime in.

  • Hi,

    Just to be clear, do you get an assert at 0x5450 or 0x2C331, and with which exact SoftDevice version and variant?

  • Sorry, I realize that was confusing.

    My build script built using sd 340 v7.0.2. That build failed as soon as I pressed a button. (button presses are tracked in the statistics record, and saved to flash)

    When I connected a debugger and removed the bootloader (for easier debugging) I mistakenly loaded s340 v6.1.1. 

    S340 v6.1.1 reported the PC was 0x5450 in the hard fault handler. I read a post that said the actual calling function was from SP+14, and this value was consistently 0x2C331 when at the breakpoint in the hard fault handler.

    Things I changed trying to fix it:

    - I was calling fds_gc from the update success event. I moved this to a flag that gets called only when the system is idle. It is no longer being called from the event callback.

    - I decreased the virtual pages to 5 from 10. I really only need 3.

    - I decreased the max write size from 4096 to 1024

    - I increased the max retries from 8 to 16

    None of these things was able to make S340 v6.1.1 happy. When I re-programmed with S340 v7.0.2, it worked.

    I would love to understand what was failing if possible. Did I fix it? S340 v7.0.2 was failing until I did something, so what did I do that made the difference so I can document it.

    Thank you.

  • Hi,

    The assert at 0x5450 in S340 6.1.1 is a "overstay event", which is typically caused by the application blocking the SoftDevice for too long, for instance (but not limited to) a flash operation taking longer time than the SoftDevice scheduled. The assert does not give any information about what blocked the SoftDevice, though.

    Reducing the maximum write size could make sense if long flash writes is the problem.

Reply Children
  • Thank you for that info! At most, I'm only writing 112 bytes plus record overhead. I guess GC would be erasing a page. 

    Is there a reason that the soft device would assert instead of just returning FDS_ERR_TIMEOUT? It seems odd that it would crash the system. I get it, just wondering the best way to mitigate the issue. I have no problem dropping the max write size down to 256, and allowing it to use multiple cycles to complete the update/gc. Maybe the default should be way lower than 4096?

    Thank you for the discussion. I know I search these pages a lot, and this info will definitely help someone else.

  • brett_anderson said:
    Is there a reason that the soft device would assert instead of just returning FDS_ERR_TIMEOUT?

    The SoftDevice assert happens when it runs and sees that more time has elapsed than what it scheduled (so typically it would have missed a BLE or ANT event, or other task that should have allready been performed). If for instance the flash.

    brett_anderson said:
    I have no problem dropping the max write size down to 256, and allowing it to use multiple cycles to complete the update/gc. Maybe the default should be way lower than 4096?

    Yes, I believe that the value should be lower in case you write large chunks of data. This has been an issue before as well, where long flash operations would cause SoftDevice asserts. Perhaps you are also seeing the limitation that "Flash write operations may exceed the timeout provided when performed with certain protocol operations (e.g. ANT Continuous Scan)."?

Related