This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

bug: one shot app_timer misfires at 256 seconds

I have a oneshot app_timer that I'm using as a software watchdog timer. Under normal operation is never supposed to go off. I'm loading the timer with a 15 second timeout and I reload it about every 2 seconds.

After 256 seconds the event handler is incorrectly getting called. Looking at a host time stamped UART log I see a reload of the timer ~2s before the timer expires.

To reload the timer I'm doing:

errCode = app_timer_stop(timeoutTimerID);
APP_ERROR_CHECK(errCode);

errCode = app_timer_start(timeoutTimerID, 15 * 32768, NULL);
APP_ERROR_CHECK(errCode);

I'm using a prescaler of 0, so the RTC timer rolls over every 512 second (2**24 / 32768). The problem isn't that the timer is rolling over, but something with how the app_timer code is handling expired timer detection.

If I recompile with a prescaler of 1, then the timer does not misfire until 512 seconds.

The code in question has 3 timers, a dummy repeating timer that expires after 2 ** 24 - 50, a one shot timer for blinking LEDs and my software watchdog timer.

Obviously increasing the prescaler is not a solution to this problem, it just delays the problem.


Additional info 2015/09/04:

So I tried a test similar to Aryan's test below. Instead of reloading the timer using another timer I used the advertising timeout (BLE_GAP_TIMEOUT_SRC_ADVERTISING) to trigger the start then stop of my app_timer based 'software watchdog'.

With the advertising timing out every 10s and stop/starting the app_timer with a 40s reload I did not see a spurious timeout at 256s (let it run for 2000s, multiple times).

However with the advertising timeout every 10s and reloading the app_timer with a 15s reload, it would reliably fail after 256s.

161813.275: RELOAD 491520 @ 6235427 190/9507
161823.275: RELOAD 491520 @ 6563174 200/9574
161833.276: RELOAD 491520 @ 6890921 210/9641
161843.277: RELOAD 491520 @ 7218668 220/9708
161853.278: RELOAD 491520 @ 7546415 230/9775
161903.279: RELOAD 491520 @ 7874162 240/9842
161913.280: RELOAD 491520 @ 8201909 250/9909
161923.281: now = 260/9979 last 260/9976                 <= Timer timed out
161923.282: TIMEOUT type 5 @ 260.3s, resetting

The "RELOAD" line is not printed until after the app_timer_stop()/app_timer_start(). The 'last' reload time is sampled right before the print, so the timeout is occurring right after the app_timer_start() is called.

The log message is in the form:

<host HHMMSS.ms>: RELOAD <ticks> @ <app_timer_cnt_get()> <seconds/32khz residual ticks>

This is very repeatable with the 10s reload, 15s interval.

If I just bump the prescaler from 0 to 1 (div 2) I get:

164113.082: RELOAD 245760 @ 7706171 470/11382
164123.082: RELOAD 245760 @ 7870045 480/11450
164133.084: RELOAD 245760 @ 8033919 490/11518
164143.084: RELOAD 245760 @ 8197793 500/11586
164153.085: RELOAD 245760 @ 8361667 510/11654
164203.086: now = 520/11728 last 520/11722
164203.088: TIMEOUT type 5 @ 520.3s, resetting

This app_timer based watchdog timer code is inside of my custom DFU bootloader. Under normal operation it never takes 256 seconds to upgrade the firmware. However, I recently added the ability to use the DFU interface to upgrade a 2MB external SPI flash. Uploading 2MB over BLE takes a bloody long time, significantly longer then 256 seconds.


Even more info:

Set a breakpoint before after the timer loop decides that the timer should expire:

Breakpoint 6, timer_timeouts_check () at sdk/components/libraries/timer/app_timer.c:402
402	            ticks_elapsed -= p_timer->ticks_to_expire;
(gdb) print *p_timer
$4 = {state = STATE_ALLOCATED, mode = APP_TIMER_MODE_SINGLE_SHOT, is_running = true, ticks_to_expire = 0, 
  ticks_at_start = 0, ticks_first_interval = 0, ticks_periodic_interval = 0, 
  p_timeout_handler = 0x3b271 <dfuTimerHandler>, p_context = 0x0, next = 0}
(gdb) print ticks_elapsed 
$5 = 8529712
(gdb) print ticks_expired 
$6 = 0

I don't understand what is going on inside the timer code.

I'm using the RTC1 for tracking real time as well, rather then modify the app_timer code to not turn off the RTC, I just set up a dummy timer with a very large timeout.

I was using a timeout of ((1 << RTC_TICKS_BITS) - APP_TIMER_MIN_TIMEOUT_TICKS * 10).

Thinking this was somehow related (not sure why one timer would effect the behavior of another), I tried reducing this down to (1 << (RTC_TICKS_BITS-1)) and I still saw this spurious. However when I set it down to (1 << (RTC_TICKS_BITS-3)) the problem seems to have gone away or I just moved the timing around enough to not trigger the bug.

My problem now is I don't trust the app_timer code. For something as critical as timer handling, the app_timer code is far to complex for its own good. The app_timer really needs a rewrite by someone that has fully accepted KISS.

  • Dave, I am starting to wonder if the case you mentioning is related to this. The problem was that the handler was called twice, Just one after the other. that is a very clear corner case that we have identified and the proposed solution of delaying 100us is only done when that corner case is triggered. In normal conditions that delay is not added.

    The quick fix that seems to work is to add a delay of 100us after timer_timeouts_check_sched() in compare_reg_update() to ensure that since the the interrupt is already pending, make sure that any potential CC event also occur before leaving SWI0_IRQHandler(). 100us is used since RTC_COMPARE_OFFSET_MIN is set to 3 (~90us).

    Problem scenario
    App_timer_start --> schedules SWI0_IRQHandler()
    SWI0_IRQHandler() --> sets the compare register
    Schedules RTC1_IRQHandler() to run if compare event is going to happen within 2 ticks. When this happens the compare event may not be triggered by the RTC.
    SWI0_IRQHandler() finishes.
    RTC1_IRQHandler() – > Scheduled one run (now during execution of RTC1_IRQHandler() the actual compare event may occur, thereby a second run of the RTC1_IRQHandler() will be pending)
    Executes the keys_send_timeout_handler(), SWI0_IRQHandler() is scheduled to update timer list
    RTC1_IRQHandler () finishes.
    RTC1_IRQHandler() – > runs again (since it occured before SWI0_IRQHandler() was scheduled)
    Executes the keys_send_timeout_handler(), SWI0_IRQHandler() is set to scheduled to update timer list (already scheduled, so no effect)
    RTC1_IRQHandler () finishes.
    SWI0_IRQHandler() --> this time removes keys_send_timeout_handler() from timer_list
    SWI0_IRQHandler() finishes.
    
    Solution Scenario
    App_timer_start --> schedules SWI0_IRQHandler()
    SWI0_IRQHandler() --> sets the compare register
    Schedules RTC1_IRQHandler() to run if compare event is going to happen within 2 ticks. When this happens the compare event may not be triggered by the RTC.
    Add a delay of 3 ticks (100us) to make sure compare event occurs. Compare event occur, however since RTC1_IRQHandler is scheduled to occur anyways, it will only run once.
    SWI0_IRQHandler() finishes.
    RTC1_IRQHandler() – > Scheduled one run
    Executes the keys_send_timeout_handler(),, SWI0_IRQHandler() is set to scheduled to update timer list
    RTC1_IRQHandler () finishes.
    SWI0_IRQHandler() --> this time removes keys_send_timeout_handler() from timer_list
    SWI0_IRQHandler() finishes
    

    This will be fixed in the next SDK release, and I was told that app_timer will come with many improvements in next SDK release (SDK9+).

    @Clem, can you try this solution and see if it fixes the problem. I am sorry to know that you lost trust on this library, but it is being used by many customers in many products. Sometimes you manage to trigger that corner case in your app and we are here you to help you with those.

  • I shall stop having difficulty believing it's a bug :)

    Possibly better fix might be, after timer_timeouts_check_sched() is called to force pend the interrupt, set the counter to ( pre_counter_value - 1 ) & 0xffffff, basically put it way in the distant future. That should ensure the counter interrupt, if it's not already fired, doesn't fire during the RTC1 handler without needing the delay.

    I guess the real problem there is that despite SWI0 and RTC1 having the same priority, RTC1 actually has an effective higher priority if both are scheduled, the numerically lower one, RTC1, is the one the NVIC chooses.

  • RK, are you now talking about the case when two interrupts of same priority occur exactly at the same time (same MPU cycle). In that case I am not sure how NVIC picks (selects) from the pending interrupt. Because I could not find convincing documentation for it, I thought that it will scan from lsb it higher (for being little ending).

    If you are talking about NVIC selecting from two same priority pending interrupts , i think it should have tail chaining and should select the one that has happened first.

    About your proposed solution, There is still a small risk that the compare event might get triggered just one or two instructions (overhead of if condition jumps and register loads) before setting it to ( pre_counter_value - 1 ) & 0xffffff and then set the pending bit again.

    Anyways the ticket created for this bug is now resolved, so this corner case have been handled somehow, i hope that it has no corner cases.

    I am still waiting for Clem's answer to see if this is the bug related to his problem.

  • No risk in that solution - if the compare triggers before the register set to far distant future and you also pend it manually, it doesn't matter, it's pended and the RTC1 IRQ hander will fire once. The case your nice diagram shows you need to guard against is having the compare interrupt fire during the RTC1 IRQ handler which causes it to re-raise the exception and run the RTC1 handler twice. Setting the counter far ahead before the IRQ routine starts avoids that.

    Both the NVIC cases are the same and are documented in the ARMV6-M manual (B1.5.4). If there is more than one pending of the same priority, the lower numbered one takes precedence. The NVIC has no way of telling which exception pended first, it only knows what's currently pending, so if it exits the RTC1 handler and finds both SWI0 and RTC1 pended, it will do RTC1 again.

  • Yes, that is true. your propose way, we will not need the delay and seems cleaner too.

Related