This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

bug: one shot app_timer misfires at 256 seconds

I have a oneshot app_timer that I'm using as a software watchdog timer. Under normal operation is never supposed to go off. I'm loading the timer with a 15 second timeout and I reload it about every 2 seconds.

After 256 seconds the event handler is incorrectly getting called. Looking at a host time stamped UART log I see a reload of the timer ~2s before the timer expires.

To reload the timer I'm doing:

errCode = app_timer_stop(timeoutTimerID);
APP_ERROR_CHECK(errCode);

errCode = app_timer_start(timeoutTimerID, 15 * 32768, NULL);
APP_ERROR_CHECK(errCode);

I'm using a prescaler of 0, so the RTC timer rolls over every 512 second (2**24 / 32768). The problem isn't that the timer is rolling over, but something with how the app_timer code is handling expired timer detection.

If I recompile with a prescaler of 1, then the timer does not misfire until 512 seconds.

The code in question has 3 timers, a dummy repeating timer that expires after 2 ** 24 - 50, a one shot timer for blinking LEDs and my software watchdog timer.

Obviously increasing the prescaler is not a solution to this problem, it just delays the problem.


Additional info 2015/09/04:

So I tried a test similar to Aryan's test below. Instead of reloading the timer using another timer I used the advertising timeout (BLE_GAP_TIMEOUT_SRC_ADVERTISING) to trigger the start then stop of my app_timer based 'software watchdog'.

With the advertising timing out every 10s and stop/starting the app_timer with a 40s reload I did not see a spurious timeout at 256s (let it run for 2000s, multiple times).

However with the advertising timeout every 10s and reloading the app_timer with a 15s reload, it would reliably fail after 256s.

161813.275: RELOAD 491520 @ 6235427 190/9507
161823.275: RELOAD 491520 @ 6563174 200/9574
161833.276: RELOAD 491520 @ 6890921 210/9641
161843.277: RELOAD 491520 @ 7218668 220/9708
161853.278: RELOAD 491520 @ 7546415 230/9775
161903.279: RELOAD 491520 @ 7874162 240/9842
161913.280: RELOAD 491520 @ 8201909 250/9909
161923.281: now = 260/9979 last 260/9976                 <= Timer timed out
161923.282: TIMEOUT type 5 @ 260.3s, resetting

The "RELOAD" line is not printed until after the app_timer_stop()/app_timer_start(). The 'last' reload time is sampled right before the print, so the timeout is occurring right after the app_timer_start() is called.

The log message is in the form:

<host HHMMSS.ms>: RELOAD <ticks> @ <app_timer_cnt_get()> <seconds/32khz residual ticks>

This is very repeatable with the 10s reload, 15s interval.

If I just bump the prescaler from 0 to 1 (div 2) I get:

164113.082: RELOAD 245760 @ 7706171 470/11382
164123.082: RELOAD 245760 @ 7870045 480/11450
164133.084: RELOAD 245760 @ 8033919 490/11518
164143.084: RELOAD 245760 @ 8197793 500/11586
164153.085: RELOAD 245760 @ 8361667 510/11654
164203.086: now = 520/11728 last 520/11722
164203.088: TIMEOUT type 5 @ 520.3s, resetting

This app_timer based watchdog timer code is inside of my custom DFU bootloader. Under normal operation it never takes 256 seconds to upgrade the firmware. However, I recently added the ability to use the DFU interface to upgrade a 2MB external SPI flash. Uploading 2MB over BLE takes a bloody long time, significantly longer then 256 seconds.


Even more info:

Set a breakpoint before after the timer loop decides that the timer should expire:

Breakpoint 6, timer_timeouts_check () at sdk/components/libraries/timer/app_timer.c:402
402	            ticks_elapsed -= p_timer->ticks_to_expire;
(gdb) print *p_timer
$4 = {state = STATE_ALLOCATED, mode = APP_TIMER_MODE_SINGLE_SHOT, is_running = true, ticks_to_expire = 0, 
  ticks_at_start = 0, ticks_first_interval = 0, ticks_periodic_interval = 0, 
  p_timeout_handler = 0x3b271 <dfuTimerHandler>, p_context = 0x0, next = 0}
(gdb) print ticks_elapsed 
$5 = 8529712
(gdb) print ticks_expired 
$6 = 0

I don't understand what is going on inside the timer code.

I'm using the RTC1 for tracking real time as well, rather then modify the app_timer code to not turn off the RTC, I just set up a dummy timer with a very large timeout.

I was using a timeout of ((1 << RTC_TICKS_BITS) - APP_TIMER_MIN_TIMEOUT_TICKS * 10).

Thinking this was somehow related (not sure why one timer would effect the behavior of another), I tried reducing this down to (1 << (RTC_TICKS_BITS-1)) and I still saw this spurious. However when I set it down to (1 << (RTC_TICKS_BITS-3)) the problem seems to have gone away or I just moved the timing around enough to not trigger the bug.

My problem now is I don't trust the app_timer code. For something as critical as timer handling, the app_timer code is far to complex for its own good. The app_timer really needs a rewrite by someone that has fully accepted KISS.

Parents
  • No risk in that solution - if the compare triggers before the register set to far distant future and you also pend it manually, it doesn't matter, it's pended and the RTC1 IRQ hander will fire once. The case your nice diagram shows you need to guard against is having the compare interrupt fire during the RTC1 IRQ handler which causes it to re-raise the exception and run the RTC1 handler twice. Setting the counter far ahead before the IRQ routine starts avoids that.

    Both the NVIC cases are the same and are documented in the ARMV6-M manual (B1.5.4). If there is more than one pending of the same priority, the lower numbered one takes precedence. The NVIC has no way of telling which exception pended first, it only knows what's currently pending, so if it exits the RTC1 handler and finds both SWI0 and RTC1 pended, it will do RTC1 again.

Reply
  • No risk in that solution - if the compare triggers before the register set to far distant future and you also pend it manually, it doesn't matter, it's pended and the RTC1 IRQ hander will fire once. The case your nice diagram shows you need to guard against is having the compare interrupt fire during the RTC1 IRQ handler which causes it to re-raise the exception and run the RTC1 handler twice. Setting the counter far ahead before the IRQ routine starts avoids that.

    Both the NVIC cases are the same and are documented in the ARMV6-M manual (B1.5.4). If there is more than one pending of the same priority, the lower numbered one takes precedence. The NVIC has no way of telling which exception pended first, it only knows what's currently pending, so if it exits the RTC1 handler and finds both SWI0 and RTC1 pended, it will do RTC1 again.

Children
No Data
Related