This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

bug: one shot app_timer misfires at 256 seconds

I have a oneshot app_timer that I'm using as a software watchdog timer. Under normal operation is never supposed to go off. I'm loading the timer with a 15 second timeout and I reload it about every 2 seconds.

After 256 seconds the event handler is incorrectly getting called. Looking at a host time stamped UART log I see a reload of the timer ~2s before the timer expires.

To reload the timer I'm doing:

errCode = app_timer_stop(timeoutTimerID);
APP_ERROR_CHECK(errCode);

errCode = app_timer_start(timeoutTimerID, 15 * 32768, NULL);
APP_ERROR_CHECK(errCode);

I'm using a prescaler of 0, so the RTC timer rolls over every 512 second (2**24 / 32768). The problem isn't that the timer is rolling over, but something with how the app_timer code is handling expired timer detection.

If I recompile with a prescaler of 1, then the timer does not misfire until 512 seconds.

The code in question has 3 timers, a dummy repeating timer that expires after 2 ** 24 - 50, a one shot timer for blinking LEDs and my software watchdog timer.

Obviously increasing the prescaler is not a solution to this problem, it just delays the problem.


Additional info 2015/09/04:

So I tried a test similar to Aryan's test below. Instead of reloading the timer using another timer I used the advertising timeout (BLE_GAP_TIMEOUT_SRC_ADVERTISING) to trigger the start then stop of my app_timer based 'software watchdog'.

With the advertising timing out every 10s and stop/starting the app_timer with a 40s reload I did not see a spurious timeout at 256s (let it run for 2000s, multiple times).

However with the advertising timeout every 10s and reloading the app_timer with a 15s reload, it would reliably fail after 256s.

161813.275: RELOAD 491520 @ 6235427 190/9507
161823.275: RELOAD 491520 @ 6563174 200/9574
161833.276: RELOAD 491520 @ 6890921 210/9641
161843.277: RELOAD 491520 @ 7218668 220/9708
161853.278: RELOAD 491520 @ 7546415 230/9775
161903.279: RELOAD 491520 @ 7874162 240/9842
161913.280: RELOAD 491520 @ 8201909 250/9909
161923.281: now = 260/9979 last 260/9976                 <= Timer timed out
161923.282: TIMEOUT type 5 @ 260.3s, resetting

The "RELOAD" line is not printed until after the app_timer_stop()/app_timer_start(). The 'last' reload time is sampled right before the print, so the timeout is occurring right after the app_timer_start() is called.

The log message is in the form:

<host HHMMSS.ms>: RELOAD <ticks> @ <app_timer_cnt_get()> <seconds/32khz residual ticks>

This is very repeatable with the 10s reload, 15s interval.

If I just bump the prescaler from 0 to 1 (div 2) I get:

164113.082: RELOAD 245760 @ 7706171 470/11382
164123.082: RELOAD 245760 @ 7870045 480/11450
164133.084: RELOAD 245760 @ 8033919 490/11518
164143.084: RELOAD 245760 @ 8197793 500/11586
164153.085: RELOAD 245760 @ 8361667 510/11654
164203.086: now = 520/11728 last 520/11722
164203.088: TIMEOUT type 5 @ 520.3s, resetting

This app_timer based watchdog timer code is inside of my custom DFU bootloader. Under normal operation it never takes 256 seconds to upgrade the firmware. However, I recently added the ability to use the DFU interface to upgrade a 2MB external SPI flash. Uploading 2MB over BLE takes a bloody long time, significantly longer then 256 seconds.


Even more info:

Set a breakpoint before after the timer loop decides that the timer should expire:

Breakpoint 6, timer_timeouts_check () at sdk/components/libraries/timer/app_timer.c:402
402	            ticks_elapsed -= p_timer->ticks_to_expire;
(gdb) print *p_timer
$4 = {state = STATE_ALLOCATED, mode = APP_TIMER_MODE_SINGLE_SHOT, is_running = true, ticks_to_expire = 0, 
  ticks_at_start = 0, ticks_first_interval = 0, ticks_periodic_interval = 0, 
  p_timeout_handler = 0x3b271 <dfuTimerHandler>, p_context = 0x0, next = 0}
(gdb) print ticks_elapsed 
$5 = 8529712
(gdb) print ticks_expired 
$6 = 0

I don't understand what is going on inside the timer code.

I'm using the RTC1 for tracking real time as well, rather then modify the app_timer code to not turn off the RTC, I just set up a dummy timer with a very large timeout.

I was using a timeout of ((1 << RTC_TICKS_BITS) - APP_TIMER_MIN_TIMEOUT_TICKS * 10).

Thinking this was somehow related (not sure why one timer would effect the behavior of another), I tried reducing this down to (1 << (RTC_TICKS_BITS-1)) and I still saw this spurious. However when I set it down to (1 << (RTC_TICKS_BITS-3)) the problem seems to have gone away or I just moved the timing around enough to not trigger the bug.

My problem now is I don't trust the app_timer code. For something as critical as timer handling, the app_timer code is far to complex for its own good. The app_timer really needs a rewrite by someone that has fully accepted KISS.

  • Tested your theory with this code

    static void reset_timer_timeout_handler(void * p_context)
    {
        static volatile uint32_t errCode = 0;
        errCode++;
        
        errCode = app_timer_stop(m_app_never_expires_timer_id);
        APP_ERROR_CHECK(errCode);
    
        errCode = app_timer_start(m_app_never_expires_timer_id, 15 * 32768, NULL);
        APP_ERROR_CHECK(errCode);
    }
    
    static void timer_timeout_handler(void * p_context)
    {
        static volatile uint32_t counter = 0;
        counter++;
    }
    
    /**@brief Function for the Timer initialization.
     *
     * @details Initializes the timer module. This creates and starts application timers.
     */
    static void timers_init(void)
    {
    
        // Initialize timer module.
        APP_TIMER_INIT(APP_TIMER_PRESCALER, APP_TIMER_MAX_TIMERS, APP_TIMER_OP_QUEUE_SIZE, false);
    
        // Create timers.
    
    
        uint32_t err_code;
        err_code = app_timer_create(&m_app_reset_timer_id, APP_TIMER_MODE_REPEATED, reset_timer_timeout_handler);
        APP_ERROR_CHECK(err_code);
        
        err_code = app_timer_create(&m_app_never_expires_timer_id, APP_TIMER_MODE_SINGLE_SHOT, timer_timeout_handler);
        APP_ERROR_CHECK(err_code);
        
        err_code = app_timer_start(m_app_reset_timer_id, 2 * 32768, NULL);
        APP_ERROR_CHECK(err_code);
     
        err_code = app_timer_start(m_app_never_expires_timer_id, 15 * 32768, NULL);
        APP_ERROR_CHECK(err_code);    
    }
    

    The timer callback timer_timeout_handler which should never expire, is never expired. I tested it with and without BLE activity. Works OK.

    It is possible that whatever your code is doing "POSSIBLY" triggering some corner case in app_timer.

    If you want me to debug you code, i can gladly help you if you give me your project in pvt message.

  • in app_timer.c try adding the following delay code after the timer_timeouts_check_sched(); call inside compare_reg_update(app_timer_id_t timer_id_head_old)()

    nrf_delay_us(100);

    I found a corner case where all my app_timers would immediately expire. Same as you is was a corner case that seemed to occur at random. But the above code seems to have solved the issue.

  • have you tried putting a breakpoint at the timeout then debugging back into the app timer code to see what's going on? If you're correct that it's when you re-start the timer you should be able to trace the instant timeout back to the point it's added and see why it's decided to expire it.

    What I'd be looking for is somewhere the next expiration counter is being calculated and then compared, modulus the counter wraparound, with the current value to see if a timer which ought to be in the future is being flagged as in the past. I could just about imagine that there's a case where doing modulus with 0xffffff (the max counter value) and then comparing might give you the wrong answer at exactly 1/2 counter value. However I'd also think that edge case was a lot harder to hit, if it exists at all, than you're finding.

  • If there actually is an edge case there wouldn't it be better to find and fix it than just work around it with a random delay? That doesn't really solve the issue, it just shuffles it out of the way and may just move it elsewhere.

    That's a good place to start looking however, if doing that stops Clem's timers expiring, he has a place to go debugging.

  • This is a fix to the app_timer module. It isn't just a random delay. According to Nordic this delay: "The issue could occur due to a race condition between the pending software interrupt to update the timer and the actual timer interrupt, but the delay will ensure that doesn't occur"

Related