This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

app_sched_evt_put() & NRF_ERROR_NO_MEM

Hi,

I've stumbled upon a problem that I haven't been able to solve for an extended period of time. I've found several other posts on this forum describing something vaguely similar but none of the solutions apply in our case.

It's a bug with our firmware that I haven't been able to reproduce under test conditions.

We have a custom board based on nrf52832 with S132.

Every once in a while it will die with error code #4 (NRF_ERROR_NO_MEM) during a call to  app_sched_evt_put().

We have implemented a solution for tracking the errors in the wild using RAM retention and the reports show that this error occurs all over the place, namely almost every line in our code where there's a reference to app_sched_evt_put() has produced this error.

Running the scheduler with the profiler enabled shows that under normal operating conditions the maximum queue utilization reaches 5 or 6 scheduled events at most.

I highly doubt that the problem is the OP_QUEUE size as we have it defined at 40 which to me seems like it should be more than enough under normal circumstances.

My suspicion is that the problem lies elsewhere and the queue gets filled up with events queued in the timer & interrupt context and there is something preventing the code execution from reaching app_sched_execute() in the main while loop.

There's one particular behavior that I've observed which occurs after 6 (!) seconds of a particular event not bein executed.

The way i've been trying to debug this is to significantly lower the watchdog timeout - from 10 seconds (on our production devices) to 2 seconds (on our debug devices) in hopes of catching the offending piece of code (my hope is that it triggers a watchdog reset instead of a soft reset) which seems to hang for 6 (!) seconds on certain occasions. I've implemented a ram retention based logging system that reports the PC before a watchdog interrupt has occured but so far I ahven't been able to zero-in on the source of the bug.

I'm mostly interested in strategy suggestions for trying to debug something like this. Or perhaps some tools that can speed up this process.

I would try tackling the problem with a debugger & dissassembly but sometimes it can take days for the problem to manifest

  • The most naïve would be to pause execution and if most of the time is spent doing a particular task you have a good chance of getting lucky and pausing while the stack pointer is in the problematic region. (Perhaps there is a bad state that causes an eternal loop in a low priority interrupt, preventing the main loop from running so that the queue fills up, but higher priority interrupts still occur and fills up the queue. This is just a wild guess though.)

    This is quite close to what is happening. new events that are triggering the no_mem error originat in the timer module and gpiote interrupts. that's a start - to look for lower priority interrupt calls. I'll try that.

    Apparently the device is in a bad state but you continue to feed the watchdog. How do you feed the watchdog? From where? And what is the criteria for feeding it or not? Should that be adjusted? And if you adjust it and then see the watchdog trigger, perhaps that would also give a clue about where there problem is?

    This is also a good point I was under the impression that WDT only gets fed from the same main loop that calls app_sched_execute but seems like i was wrong (I've taken over the codebase a few months ago and it's never ceases to surprise me Upside down). I've found some other places where the watchdog gets fed - and those bits of code are gonna fall under scrutiny next.

    thanks for the debugger pausing suggestion too. I'll see if it yields any interesting results...

    Seems like I've got a lot of stuff to do... but i'll be back

    Thanks Einar

  • Looking at the priority system, I had this thought:

    Are there any downsides to implementing a timer that runs with elevated priority, say priority 3,

    that would occasionally fire and just call app_sched execute, to guarantee that events are cleared.

    obviosly nothing except for softdevice is running with higher priority. Gpiote has priority 3 at the moment (dubious) but surely even gpiote events would be cleared this way even if they are the one's responsible.

    everything else is running with priority 6. But my understanding is that main runs at an even lower priority...

    Is there anything inherently wrong with this approach? other than ofcourse it doesn't solve the underlying problem...

    also following your suggestion I tried pausing the debugger but under normal circumstances the PC  always points at  SEV which is what one would expect. This indicates to me that the offending piece of code doesn't normally take long to execute and only ends up in this state under specific conditions.

  • Hi,

    Main does not run in an interrupt at all, so it is the lowest possible priority.

    I am not sure about your approach. As you say this would not solve the underlying problem, you only hope to mask it. And worst case it could have side effects. The primary reason for using the scheduler in the SDK is to move the priority down, so that you process your event handling (the part that does not need rapid handling) in the main loop instead of in an interrupt. If you start to process this from an interrupt instead that could have some side effects, though I cannot say which without having good knowledge of your application. Generally I would say that if that is something you consider, then perhaps you could just as well remove the scheduler all together and process all interrupts immediately. But again, I do not know your app enough to say anything.

Related