This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

bug: one shot app_timer misfires at 256 seconds

I have a oneshot app_timer that I'm using as a software watchdog timer. Under normal operation is never supposed to go off. I'm loading the timer with a 15 second timeout and I reload it about every 2 seconds.

After 256 seconds the event handler is incorrectly getting called. Looking at a host time stamped UART log I see a reload of the timer ~2s before the timer expires.

To reload the timer I'm doing:

errCode = app_timer_stop(timeoutTimerID);
APP_ERROR_CHECK(errCode);

errCode = app_timer_start(timeoutTimerID, 15 * 32768, NULL);
APP_ERROR_CHECK(errCode);

I'm using a prescaler of 0, so the RTC timer rolls over every 512 second (2**24 / 32768). The problem isn't that the timer is rolling over, but something with how the app_timer code is handling expired timer detection.

If I recompile with a prescaler of 1, then the timer does not misfire until 512 seconds.

The code in question has 3 timers, a dummy repeating timer that expires after 2 ** 24 - 50, a one shot timer for blinking LEDs and my software watchdog timer.

Obviously increasing the prescaler is not a solution to this problem, it just delays the problem.


Additional info 2015/09/04:

So I tried a test similar to Aryan's test below. Instead of reloading the timer using another timer I used the advertising timeout (BLE_GAP_TIMEOUT_SRC_ADVERTISING) to trigger the start then stop of my app_timer based 'software watchdog'.

With the advertising timing out every 10s and stop/starting the app_timer with a 40s reload I did not see a spurious timeout at 256s (let it run for 2000s, multiple times).

However with the advertising timeout every 10s and reloading the app_timer with a 15s reload, it would reliably fail after 256s.

161813.275: RELOAD 491520 @ 6235427 190/9507
161823.275: RELOAD 491520 @ 6563174 200/9574
161833.276: RELOAD 491520 @ 6890921 210/9641
161843.277: RELOAD 491520 @ 7218668 220/9708
161853.278: RELOAD 491520 @ 7546415 230/9775
161903.279: RELOAD 491520 @ 7874162 240/9842
161913.280: RELOAD 491520 @ 8201909 250/9909
161923.281: now = 260/9979 last 260/9976                 <= Timer timed out
161923.282: TIMEOUT type 5 @ 260.3s, resetting

The "RELOAD" line is not printed until after the app_timer_stop()/app_timer_start(). The 'last' reload time is sampled right before the print, so the timeout is occurring right after the app_timer_start() is called.

The log message is in the form:

<host HHMMSS.ms>: RELOAD <ticks> @ <app_timer_cnt_get()> <seconds/32khz residual ticks>

This is very repeatable with the 10s reload, 15s interval.

If I just bump the prescaler from 0 to 1 (div 2) I get:

164113.082: RELOAD 245760 @ 7706171 470/11382
164123.082: RELOAD 245760 @ 7870045 480/11450
164133.084: RELOAD 245760 @ 8033919 490/11518
164143.084: RELOAD 245760 @ 8197793 500/11586
164153.085: RELOAD 245760 @ 8361667 510/11654
164203.086: now = 520/11728 last 520/11722
164203.088: TIMEOUT type 5 @ 520.3s, resetting

This app_timer based watchdog timer code is inside of my custom DFU bootloader. Under normal operation it never takes 256 seconds to upgrade the firmware. However, I recently added the ability to use the DFU interface to upgrade a 2MB external SPI flash. Uploading 2MB over BLE takes a bloody long time, significantly longer then 256 seconds.


Even more info:

Set a breakpoint before after the timer loop decides that the timer should expire:

Breakpoint 6, timer_timeouts_check () at sdk/components/libraries/timer/app_timer.c:402
402	            ticks_elapsed -= p_timer->ticks_to_expire;
(gdb) print *p_timer
$4 = {state = STATE_ALLOCATED, mode = APP_TIMER_MODE_SINGLE_SHOT, is_running = true, ticks_to_expire = 0, 
  ticks_at_start = 0, ticks_first_interval = 0, ticks_periodic_interval = 0, 
  p_timeout_handler = 0x3b271 <dfuTimerHandler>, p_context = 0x0, next = 0}
(gdb) print ticks_elapsed 
$5 = 8529712
(gdb) print ticks_expired 
$6 = 0

I don't understand what is going on inside the timer code.

I'm using the RTC1 for tracking real time as well, rather then modify the app_timer code to not turn off the RTC, I just set up a dummy timer with a very large timeout.

I was using a timeout of ((1 << RTC_TICKS_BITS) - APP_TIMER_MIN_TIMEOUT_TICKS * 10).

Thinking this was somehow related (not sure why one timer would effect the behavior of another), I tried reducing this down to (1 << (RTC_TICKS_BITS-1)) and I still saw this spurious. However when I set it down to (1 << (RTC_TICKS_BITS-3)) the problem seems to have gone away or I just moved the timing around enough to not trigger the bug.

My problem now is I don't trust the app_timer code. For something as critical as timer handling, the app_timer code is far to complex for its own good. The app_timer really needs a rewrite by someone that has fully accepted KISS.

  • Where's that quote from? I just searched the forums and can't find it, and just grepped 3 SDKs and can't find it there either. You'd think if there's a known race condition they'd have put the delay in the code.

  • This is from a support case a co-worker was dealing with Nordic directly on. I should note we are using SDK 6.1 not SDK8.0. I'm not sure if the race-condition is on-going thru SDK8 or if it was fixed another way. In case it persisted I wanted him to try it out. Would be good know, I'm migrating to SDK8 soon ;)

  • That's why I couldn't find it then :). Diffing 6.x and SDK9 the app_timer code is almost entirely unchanged so if it had a bug then, it has a bug now.

    I have some difficulty believing this really is a race condition, despite the quote. It shouldn't matter how often you bang the timeout code, if there's nothing to timeout, there's nothing to timeout. All the timeouts are done in the IRQ handler so even reschedules during timeouts aren't re-entrant and the queue updating interrupt has the same priority as the timeout one, so they are mutually exclusive. The only way something times out is if the first ticks count on the list is smaller than the elapsed ticks. In Clem's case 'first ticks' should always be quite large, which makes me wonder if elapsed ticks is being incorrectly calculated at exact half-distance on the timer .. but I don't see that either.

    We need some repro code

  • lots of oddities in that debug. ticks_elapsed, it's huge, it should be tiny. ticks_to_expire shouldn't be zero. For ticks_elapsed to be that large it would seem that m_ticks_latest would have to be zero but the counter not reset.

    the timer code isn't that complicated, it works out how many ticks to expiry of a timer you're inserting and keeps a list, then knocks ticks off the first entry in the list until it expires. But having dynamic lists of timers which can be single-shot, or repeating and added and removed at random times is not trivial.

    You have 3 choices really. Keep debugging it until you do understand what's going on, send the code to Aryan who's offered to debug it for you, or write your own timer implementation which is simpler and only covers the cases you need, not generic timers.

    You have a totally reproducible test case, finding the issue now is just debugging.

  • There does seem to be bad interaction between timers with a very long timeout and timers that stop/start, resulting in miss-fires.

    For now I have reduced the timeout on my very long keepalive timer and the problem has gone away. The keepalive timer is only needed to prevent app_timer from ever stopping the RTC and breaking my wall clock time tracking.

    I'm not saying that app_timer is complicated, I'm just saying it is significantly more complicated then it really needs to be for what it is doing. I did site down and write a generic priority queue timer routines, with single shot and repeated timers, timeouts >2**24 ticks and real time tracking (for use with CTS). The problem is I don't have the time/energy/desire to go through the effort to test it enough to actually use in production code. Maybe if I end up doing a project with the nrf52, I'll go through the effort to test the code.

    Sadly, I don't think my test case would be of any use to Aryan because it is fairly tightly coupled to my hardware and build environment. To be of any use I'd have to cull it down into something minimal that shows off the problem. However, now that I think I know the origin of the problem (the long timeout timer), Aryan might be able to modify his test case to get it to show the problem.

Related