bug: one shot app_timer misfires at 256 seconds

Question

I have a oneshot app_timer that I'm using as a software watchdog timer. Under normal operation is never supposed to go off. I'm loading the timer with a 15 second timeout and I reload it about every 2 seconds.

After 256 seconds the event handler is incorrectly getting called. Looking at a host time stamped UART log I see a reload of the timer ~2s before the timer expires.

To reload the timer I'm doing:

errCode = app_timer_stop(timeoutTimerID);
APP_ERROR_CHECK(errCode);

errCode = app_timer_start(timeoutTimerID, 15 * 32768, NULL);
APP_ERROR_CHECK(errCode);

I'm using a prescaler of 0, so the RTC timer rolls over every 512 second (2**24 / 32768). The problem isn't that the timer is rolling over, but something with how the app_timer code is handling expired timer detection.

If I recompile with a prescaler of 1, then the timer does not misfire until 512 seconds.

The code in question has 3 timers, a dummy repeating timer that expires after 2 ** 24 - 50, a one shot timer for blinking LEDs and my software watchdog timer.

Obviously increasing the prescaler is not a solution to this problem, it just delays the problem.

Additional info 2015/09/04:

So I tried a test similar to Aryan's test below. Instead of reloading the timer using another timer I used the advertising timeout (BLE_GAP_TIMEOUT_SRC_ADVERTISING) to trigger the start then stop of my app_timer based 'software watchdog'.

With the advertising timing out every 10s and stop/starting the app_timer with a 40s reload I did not see a spurious timeout at 256s (let it run for 2000s, multiple times).

However with the advertising timeout every 10s and reloading the app_timer with a 15s reload, it would reliably fail after 256s.

161813.275: RELOAD 491520 @ 6235427 190/9507
161823.275: RELOAD 491520 @ 6563174 200/9574
161833.276: RELOAD 491520 @ 6890921 210/9641
161843.277: RELOAD 491520 @ 7218668 220/9708
161853.278: RELOAD 491520 @ 7546415 230/9775
161903.279: RELOAD 491520 @ 7874162 240/9842
161913.280: RELOAD 491520 @ 8201909 250/9909
161923.281: now = 260/9979 last 260/9976                 <= Timer timed out
161923.282: TIMEOUT type 5 @ 260.3s, resetting

The "RELOAD" line is not printed until after the app_timer_stop()/app_timer_start(). The 'last' reload time is sampled right before the print, so the timeout is occurring right after the app_timer_start() is called.

The log message is in the form:

<host HHMMSS.ms>: RELOAD <ticks> @ <app_timer_cnt_get()> <seconds/32khz residual ticks>

This is very repeatable with the 10s reload, 15s interval.

If I just bump the prescaler from 0 to 1 (div 2) I get:

164113.082: RELOAD 245760 @ 7706171 470/11382
164123.082: RELOAD 245760 @ 7870045 480/11450
164133.084: RELOAD 245760 @ 8033919 490/11518
164143.084: RELOAD 245760 @ 8197793 500/11586
164153.085: RELOAD 245760 @ 8361667 510/11654
164203.086: now = 520/11728 last 520/11722
164203.088: TIMEOUT type 5 @ 520.3s, resetting

This app_timer based watchdog timer code is inside of my custom DFU bootloader. Under normal operation it never takes 256 seconds to upgrade the firmware. However, I recently added the ability to use the DFU interface to upgrade a 2MB external SPI flash. Uploading 2MB over BLE takes a bloody long time, significantly longer then 256 seconds.

Even more info:

Set a breakpoint before after the timer loop decides that the timer should expire:

Breakpoint 6, timer_timeouts_check () at sdk/components/libraries/timer/app_timer.c:402
402	            ticks_elapsed -= p_timer->ticks_to_expire;
(gdb) print *p_timer
$4 = {state = STATE_ALLOCATED, mode = APP_TIMER_MODE_SINGLE_SHOT, is_running = true, ticks_to_expire = 0, 
  ticks_at_start = 0, ticks_first_interval = 0, ticks_periodic_interval = 0, 
  p_timeout_handler = 0x3b271 <dfuTimerHandler>, p_context = 0x0, next = 0}
(gdb) print ticks_elapsed 
$5 = 8529712
(gdb) print ticks_expired 
$6 = 0

I don't understand what is going on inside the timer code.

I'm using the RTC1 for tracking real time as well, rather then modify the app_timer code to not turn off the RTC, I just set up a dummy timer with a very large timeout.

I was using a timeout of ((1 << RTC_TICKS_BITS) - APP_TIMER_MIN_TIMEOUT_TICKS * 10).

Thinking this was somehow related (not sure why one timer would effect the behavior of another), I tried reducing this down to (1 << (RTC_TICKS_BITS-1)) and I still saw this spurious. However when I set it down to (1 << (RTC_TICKS_BITS-3)) the problem seems to have gone away or I just moved the timing around enough to not trigger the bug.

My problem now is I don't trust the app_timer code. For something as critical as timer handling, the app_timer code is far to complex for its own good. The app_timer really needs a rewrite by someone that has fully accepted KISS.

bug: one shot app_timer misfires at 256 seconds

Top Replies