This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Backoff mechanism for reliable mesh messages

Hi,

I have a question about the time period being used as waiting time before sending a reliable message is retried, this until an ACK is received or the timeout is reached (minimal 30 s). According to the standard, this is "application specific". From what I measure using the Nordic Mesh SDK stack, this seems to be first two tries after 280 ms, and then each next try after the multiple of the previous. This gives for a timeout of 30 s, a maximum of 8 tries or 8 packets being send:

0.00 (first) after 280.37 ms after 279.97 ms after 559.98 ms after 1119.96 ms after 2239.94 ms after 4479.87 ms after 8959.84 ms

Is this correct and is this something that's unchangeable to me as a developer?

Thanks in advance.

Kind regards,

Mathias

  • Hi,

    The timeout is based on TTL, and the exact formulas can be read in mesh/core/src/transport.c:

    static inline uint32_t rx_ack_timer_delay_get(uint8_t ttl)
    {
        return m_trs_config.rx_ack_base_timeout + m_trs_config.rx_ack_per_hop_addition * ttl;
    }
    
    static inline uint32_t tx_retry_timer_delay_get(uint8_t ttl)
    {
        return m_trs_config.tx_retry_base_timeout + m_trs_config.tx_retry_per_hop_addition * ttl;
    }

    The other parameters used are set to default values provided in mesh/core/include/transport.h:

    /** Default base RX acknowledgement timeout. */
    #define TRANSPORT_SAR_RX_ACK_BASE_TIMEOUT_DEFAULT_US MS_TO_US(150)
    
    /** Default per hop RX acknowledgement timeout addition. */
    #define TRANSPORT_SAR_RX_ACK_PER_HOP_ADDITION_DEFAULT_US MS_TO_US(50)
    
    /** Default base TX retry timeout. */
    #define TRANSPORT_SAR_TX_RETRY_BASE_TIMEOUT_DEFAULT_US MS_TO_US(500)
    
    /** Default per hop TX retry timeout addition. */
    #define TRANSPORT_SAR_TX_RETRY_PER_HOP_ADDITION_DEFAULT_US MS_TO_US(50)

    These values can also be changed runtime, using nrf_mesh_opt_set().

    Regards,
    Terje

  • Okay, thank you very much for the explanation.

    Kind regards, Mathias

  • Hi,

    I recently found out that the standard defines this application specific timeout on the Access Layer. I then looked there for this functionality. I found these constants:


    /** Penalty in microseconds for each hop for a reliable message. */
    #define ACCESS_RELIABLE_HOP_PENALTY (MS_TO_US(BEARER_ADV_INT_DEFAULT_MS))
    /**
     * Base interval in microseconds for a reliable message.
     * I.e., the interval given TTL=0 and an unsegmented message.
     */
    #define ACCESS_RELIABLE_INTERVAL_DEFAULT (MS_TO_US(BEARER_ADV_INT_DEFAULT_MS) * 10)

    /** Back-off factor used to increase the interval for each retry. */
    #define ACCESS_RELIABLE_BACK_OFF_FACTOR (2)

    The first two explain the 280 ms interval, which comes from ACCESS_RELIABLE_INTERVAL_DEFAULT + TTL * ACCESS_RELIABLE_HOP_PENALTY , where BEARER_ADV_INT_DEFAULT_MS is 20 ms and TTL is 4 in my case (default values), so this gives 200 ms + 80 ms = 280 ms. Because of the ACCESS_RELIABLE_BACK_OFF_FACTOR  being 2, each following retry the interval is doubles (*2) which is also the case in my measurements.

    I also found these methods in access_reliable.c which explain the rest:

    static uint32_t calculate_interval(const access_reliable_t * p_message)
    {
        uint8_t ttl;
        /* The model handle should already been checked by the TX attempt. */
        NRF_MESH_ERROR_CHECK(access_model_publish_ttl_get(p_message->model_handle, &ttl));

        uint16_t length = access_utils_opcode_size_get(p_message->message.opcode) + p_message->message.length;

        uint32_t interval = (ttl * ACCESS_RELIABLE_HOP_PENALTY) + ACCESS_RELIABLE_INTERVAL_DEFAULT;
        if (NRF_MESH_UNSEG_PAYLOAD_SIZE_MAX < length)
        {
            interval += ((length + (NRF_MESH_SEG_SIZE - 1))/ NRF_MESH_SEG_SIZE) * ACCESS_RELIABLE_SEGMENT_COUNT_PENALTY;
        }
        return interval;
    }

    static void reliable_timer_cb(timestamp_t timestamp, void * p_context)
    {
        NRF_MESH_ASSERT(0 < m_reliable.active_count);
        //__LOG(LOG_SRC_APP, LOG_LEVEL_INFO,  "RELIABLE TIMER CALLBACK\n");

        timestamp += ACCESS_RELIABLE_TIMEOUT_MARGIN; /* TODO: Divide by two? */
        m_reliable.next_timeout_index = ACCESS_RELIABLE_INDEX_INVALID;

        for (uint32_t i = 0; i < ACCESS_RELIABLE_TRANSFER_COUNT; ++i)
        {
            if (!m_reliable.pool[i].in_use)
            {
                continue;
            }
            else if (TIMER_OLDER_THAN(m_reliable.pool[i].params.timeout, timestamp))
            {
                /* Remove first, in case a crazy user tries to reschedule it in the callback. */
                m_reliable.pool[i].in_use = false;
                m_reliable.active_count--;

                void * p_args;
                NRF_MESH_ERROR_CHECK(access_model_p_args_get(m_reliable.pool[i].params.model_handle, &p_args));
                m_reliable.pool[i].params.status_cb(m_reliable.pool[i].params.model_handle, p_args, ACCESS_RELIABLE_TRANSFER_TIMEOUT);
            }
            else if (TIMER_OLDER_THAN(m_reliable.pool[i].next_timeout, timestamp))
            {
                uint32_t status = access_model_publish(m_reliable.pool[i].params.model_handle, &m_reliable.pool[i].params.message);
                m_reliable.next_timeout_index = i;
                if (NRF_SUCCESS == status)
                {
                    m_reliable.pool[i].next_timeout += m_reliable.pool[i].interval;
                    m_reliable.pool[i].interval *= ACCESS_RELIABLE_BACK_OFF_FACTOR;
                }
                else if (NRF_ERROR_NO_MEM == status)
                {
                    /* If there is no more memory available, we might as well cancel the rest and set
                     * the timer to fire in ACCESS_RELIABLE_RETRY_DELAY. */
                    m_reliable.pool[i].next_timeout += ACCESS_RELIABLE_RETRY_DELAY;
                    break;
                }
                else
                {
                    /* This should have been caught by the first publish() call. */
                    NRF_MESH_ASSERT(false);
                }

                if (TIMER_OLDER_THAN(m_reliable.pool[i].params.timeout, m_reliable.pool[i].next_timeout))
                {
                    /* Shift timeout forward. */
                    m_reliable.pool[i].next_timeout = m_reliable.pool[i].params.timeout;
                }
            }
            else if (ACCESS_RELIABLE_INDEX_INVALID == m_reliable.next_timeout_index ||
                     TIMER_OLDER_THAN(m_reliable.pool[i].next_timeout,
                                      m_reliable.pool[m_reliable.next_timeout_index].next_timeout))
            {
                /* Keep track of the next firing timeout. */
                m_reliable.next_timeout_index = i;
            }
        }

        /* Setting the interval > 0 will reschedule the timer. */
        if (m_reliable.active_count > 0)
        {
            NRF_MESH_ASSERT(m_reliable.next_timeout_index < ACCESS_RELIABLE_TRANSFER_COUNT);
            timestamp -= ACCESS_RELIABLE_TIMEOUT_MARGIN;
            m_reliable.timer.interval = TIMER_DIFF(m_reliable.pool[m_reliable.next_timeout_index].next_timeout, timestamp);
        }
        else
        {
            m_reliable.timer.interval = 0;
        }
    }

    I don't fully understand why the transport layer also needs to define timeout configuration but I can see.

Related