Hi,
I've stumbled upon a problem that I haven't been able to solve for an extended period of time. I've found several other posts on this forum describing something vaguely similar but none of the solutions apply in our case.
It's a bug with our firmware that I haven't been able to reproduce under test conditions.
We have a custom board based on nrf52832 with S132.
Every once in a while it will die with error code #4 (NRF_ERROR_NO_MEM) during a call to app_sched_evt_put().
We have implemented a solution for tracking the errors in the wild using RAM retention and the reports show that this error occurs all over the place, namely almost every line in our code where there's a reference to app_sched_evt_put() has produced this error.
Running the scheduler with the profiler enabled shows that under normal operating conditions the maximum queue utilization reaches 5 or 6 scheduled events at most.
I highly doubt that the problem is the OP_QUEUE size as we have it defined at 40 which to me seems like it should be more than enough under normal circumstances.
My suspicion is that the problem lies elsewhere and the queue gets filled up with events queued in the timer & interrupt context and there is something preventing the code execution from reaching app_sched_execute() in the main while loop.
There's one particular behavior that I've observed which occurs after 6 (!) seconds of a particular event not bein executed.
The way i've been trying to debug this is to significantly lower the watchdog timeout - from 10 seconds (on our production devices) to 2 seconds (on our debug devices) in hopes of catching the offending piece of code (my hope is that it triggers a watchdog reset instead of a soft reset) which seems to hang for 6 (!) seconds on certain occasions. I've implemented a ram retention based logging system that reports the PC before a watchdog interrupt has occured but so far I ahven't been able to zero-in on the source of the bug.
I'm mostly interested in strategy suggestions for trying to debug something like this. Or perhaps some tools that can speed up this process.
I would try tackling the problem with a debugger & dissassembly but sometimes it can take days for the problem to manifest