Backoff mechanism for reliable mesh messages

Question

Hi, 
 I have a question about the time period being used as waiting time before sending a reliable message is retried, this until an ACK is received or the timeout is reached (minimal 30 s). According to the standard, this is "application specific". From what I measure using the Nordic Mesh SDK stack, this seems to be first two tries after 280 ms, and then each next try after the multiple of the previous. This gives for a timeout of 30 s, a maximum of 8 tries or 8 packets being send:

0.00 (first) 
 after 280.37 ms 
 after 279.97 ms 
 after 559.98 ms 
 after 1119.96 ms 
 after 2239.94 ms 
 after 4479.87 ms 
 after 8959.84 ms

Is this correct and is this something that's unchangeable to me as a developer? 
 Thanks in advance. 
 Kind regards, 
 Mathias

Mathias · Accepted Answer

Hi,

I recently found out that the standard defines this application specific timeout on the Access Layer. I then looked there for this functionality. I found these constants:

/** Penalty in microseconds for each hop for a reliable message. */
#define ACCESS_RELIABLE_HOP_PENALTY (MS_TO_US(BEARER_ADV_INT_DEFAULT_MS))
/**
* Base interval in microseconds for a reliable message.
* I.e., the interval given TTL=0 and an unsegmented message.
*/
#define ACCESS_RELIABLE_INTERVAL_DEFAULT (MS_TO_US(BEARER_ADV_INT_DEFAULT_MS) * 10)

/** Back-off factor used to increase the interval for each retry. */
#define ACCESS_RELIABLE_BACK_OFF_FACTOR (2)

The first two explain the 280 ms interval, which comes from ACCESS_RELIABLE_INTERVAL_DEFAULT + TTL * ACCESS_RELIABLE_HOP_PENALTY , where BEARER_ADV_INT_DEFAULT_MS is 20 ms and TTL is 4 in my case (default values), so this gives 200 ms + 80 ms = 280 ms. Because of the ACCESS_RELIABLE_BACK_OFF_FACTOR being 2, each following retry the interval is doubles (*2) which is also the case in my measurements.

I also found these methods in access_reliable.c which explain the rest:

static uint32_t calculate_interval(const access_reliable_t * p_message)
{
    uint8_t ttl;
    /* The model handle should already been checked by the TX attempt. */
    NRF_MESH_ERROR_CHECK(access_model_publish_ttl_get(p_message->model_handle, &ttl));

    uint16_t length = access_utils_opcode_size_get(p_message->message.opcode) + p_message->message.length;

    uint32_t interval = (ttl * ACCESS_RELIABLE_HOP_PENALTY) + ACCESS_RELIABLE_INTERVAL_DEFAULT;
    if (NRF_MESH_UNSEG_PAYLOAD_SIZE_MAX < length)
    {
        interval += ((length + (NRF_MESH_SEG_SIZE - 1))/ NRF_MESH_SEG_SIZE) * ACCESS_RELIABLE_SEGMENT_COUNT_PENALTY;
    }
    return interval;
}

static void reliable_timer_cb(timestamp_t timestamp, void * p_context)
{
    NRF_MESH_ASSERT(0 < m_reliable.active_count);
    //__LOG(LOG_SRC_APP, LOG_LEVEL_INFO, "RELIABLE TIMER CALLBACK\n");

    timestamp += ACCESS_RELIABLE_TIMEOUT_MARGIN; /* TODO: Divide by two? */
    m_reliable.next_timeout_index = ACCESS_RELIABLE_INDEX_INVALID;

    for (uint32_t i = 0; i < ACCESS_RELIABLE_TRANSFER_COUNT; ++i)
    {
        if (!m_reliable.pool[i].in_use)
        {
            continue;
        }
        else if (TIMER_OLDER_THAN(m_reliable.pool[i].params.timeout, timestamp))
        {
            /* Remove first, in case a crazy user tries to reschedule it in the callback. */
            m_reliable.pool[i].in_use = false;
            m_reliable.active_count--;

            void * p_args;
            NRF_MESH_ERROR_CHECK(access_model_p_args_get(m_reliable.pool[i].params.model_handle, &p_args));
            m_reliable.pool[i].params.status_cb(m_reliable.pool[i].params.model_handle, p_args, ACCESS_RELIABLE_TRANSFER_TIMEOUT);
        }
        else if (TIMER_OLDER_THAN(m_reliable.pool[i].next_timeout, timestamp))
        {
            uint32_t status = access_model_publish(m_reliable.pool[i].params.model_handle, &m_reliable.pool[i].params.message);
            m_reliable.next_timeout_index = i;
            if (NRF_SUCCESS == status)
            {
                m_reliable.pool[i].next_timeout += m_reliable.pool[i].interval;
                m_reliable.pool[i].interval *= ACCESS_RELIABLE_BACK_OFF_FACTOR;
            }
            else if (NRF_ERROR_NO_MEM == status)
            {
                /* If there is no more memory available, we might as well cancel the rest and set
                 * the timer to fire in ACCESS_RELIABLE_RETRY_DELAY. */
                m_reliable.pool[i].next_timeout += ACCESS_RELIABLE_RETRY_DELAY;
                break;
            }
            else
            {
                /* This should have been caught by the first publish() call. */
                NRF_MESH_ASSERT(false);
            }

            if (TIMER_OLDER_THAN(m_reliable.pool[i].params.timeout, m_reliable.pool[i].next_timeout))
            {
                /* Shift timeout forward. */
                m_reliable.pool[i].next_timeout = m_reliable.pool[i].params.timeout;
            }
        }
        else if (ACCESS_RELIABLE_INDEX_INVALID == m_reliable.next_timeout_index ||
                 TIMER_OLDER_THAN(m_reliable.pool[i].next_timeout,
                                  m_reliable.pool[m_reliable.next_timeout_index].next_timeout))
        {
            /* Keep track of the next firing timeout. */
            m_reliable.next_timeout_index = i;
        }
    }

    /* Setting the interval > 0 will reschedule the timer. */
    if (m_reliable.active_count > 0)
    {
        NRF_MESH_ASSERT(m_reliable.next_timeout_index < ACCESS_RELIABLE_TRANSFER_COUNT);
        timestamp -= ACCESS_RELIABLE_TIMEOUT_MARGIN;
        m_reliable.timer.interval = TIMER_DIFF(m_reliable.pool[m_reliable.next_timeout_index].next_timeout, timestamp);
    }
    else
    {
        m_reliable.timer.interval = 0;
    }
}

I don't fully understand why the transport layer also needs to define timeout configuration but I can see.

Backoff mechanism for reliable mesh messages

Top Replies