This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Mesh Fails to publish and relay packets until RTC timer wraps around

Using Mesk v3.1 on an 52840 chip. 

We've noticed that after running our app for several hours or days that our models become unable to send messages for a period of ~8.5 minutes. Once the first publication fails (error code 0x4) all further publications (for all models) fail with the same code until the RTC1 timer wraps around. We are only sending unsegmented packets. 
The value of the RTC timer in the cases we have observed (via logging over RTT) is around 0x90000, so somewhere after the half-way point but no where near the overflow.
We suspected the timer_scheduler module is failing to be triggered and that NO messages were being sent out (originator or relay) so we modified the SDK slightly to track the number of packet complete callbacks being hit to verify this (diff below):

-- a/Dependencies/nRF5_Mesh_SDK/3.1.0/src/mesh/core/src/core_tx_adv.c
+++ b/Dependencies/nRF5_Mesh_SDK/3.1.0/src/mesh/core/src/core_tx_adv.c
@@ -77,11 +77,30 @@ static core_tx_bearer_t m_bearer;
 * Static functions
 *****************************************************************************/
 
+volatile uint32_t relayCounter = 0;^M
+volatile uint32_t originCounter = 0;^M
+^M
+static void txRelayCb(advertiser_t * p_adv,^M
+                                     nrf_mesh_tx_token_t token,^M
+                                     timestamp_t timestamp)^M
+{^M
+    const core_tx_role_t role = (core_tx_role_t) (PARENT_BY_FIELD_GET(adv_bearer_role_t, advertiser, p_adv) - &m_bearer_roles[0]);^M
+^M
+    if (CORE_TX_ROLE_RELAY == role) {^M
+        relayCounter++;^M
+    }^M
+}^M
+^M
 static void adv_tx_complete_callback(advertiser_t * p_adv,
                                      nrf_mesh_tx_token_t token,
                                      timestamp_t timestamp)
 {
-    core_tx_role_t role = (core_tx_role_t) (PARENT_BY_FIELD_GET(adv_bearer_role_t, advertiser, p_adv) - &m_bearer_roles[0]);
+    const core_tx_role_t role = (core_tx_role_t) (PARENT_BY_FIELD_GET(adv_bearer_role_t, advertiser, p_adv) - &m_bearer_roles[0]);^M
+^M
+    if (CORE_TX_ROLE_ORIGINATOR == role) {^M
+        originCounter++;^M
+    }^M
+^M
     core_tx_complete(&m_bearer, role, timestamp, token);
 }
 
@@ -312,7 +331,7 @@ void core_tx_adv_init(void)
 #if MESH_FEATURE_RELAY_ENABLED
     m_bearer_roles[CORE_TX_ROLE_RELAY].adv_tx_count = CORE_TX_REPEAT_RELAY_DEFAULT;
     advertiser_instance_init(&m_bearer_roles[CORE_TX_ROLE_RELAY].advertiser,
-                             NULL,
+                             txRelayCb,^M
                              m_relay_adv_packet_buffer,
                              sizeof(m_relay_adv_packet_buffer));
     advertiser_enable(&m_bearer_roles[CORE_TX_ROLE_RELAY].advertiser);

During normal operation these counters increment as you would expect but during the 8.5 minutes of failures, these counters do not move. 

We are using app_timer_mesh and have some app_timer instances in our code running in repeated mode only. During the failure period, those timers continue to operate normally and we continue to receive packets from other devices. we just can't respond.

Below are snippets of the logs showing the RTC values of when the errors start and when normal operation resumes

[INFO] 8996078> access.c: 1d5f:c6                                                                                                                                                    
[DEBUG] 8996096> svmodel.c: Pub Response to 0x6: 
[DEBUG] 9001746> svmodel.c: Query Rx {0x21}:                                                                                         
[INFO] 9002632> access.c: 1d5f:c6                                                                                                                                                    
[ERROR] 9002639> svmodel.c: Fail(0x4) Pub Response to 0x21:                    
[DEBUG] 9004635> svmodel.c: Query Rx {0x4}: treeId:0x6cc300ca, depth:4, orphaned:0                                                                                                
[INFO] 9005909> access.c: 1d5f:c6                                                                                                                                                    
[ERROR] 9005916> svmodel.c: Fail(0x4) Pub Response to 0x4: 
...
continues like this
...
[DEBUG] 8951221> svmodel.c: poll Enter: ACTIVE/1
[INFO] 8951228> access.c: 1d5f:c5
[ERROR] 8951237> svmodel.c: Fail(0x4) Pub Query to 0xc003: 
[DEBUG] 8951245> svmodel.c: poll Exit: ACTIVE/1
[INFO] 8984000> access.c: 1d5f:c5
[DEBUG] 8984021> svmodel.c: Pub Query to 0xc003:
...
no more errors past this point.

        

I've reviewed the code delta's pertaining to the timer_scheduler, timer, app_timer_mesh, core_tx_adv and advertiser between v3.1, v3.2 and v4 and haven't seen any significant deltas.
Any ideas on what this could be?

  • Hi, 

    We will have to check with the R&D team if they can find any issue related to this. 

    Could you give some more information regarding your model ? Do you send a lot of data ? What's the data rate ? 


    Would the same issue can be observed on the light switch example ? Have you tried to test that ? 

  • The models publish periodically at a random interval between 1-3 seconds and in response to certain requests.

    On average we typically see 4-5  publishes in total per second (all unsegmented with no retransmission configured at the access layer).

    When this issue occurs it does not seem correlated to the number of messages sent or the packet transmission rate within the last couple seconds. In some cases we have only sent 1 or 2 messages within the last second, in other cases we have sent 5 messages. 

    Here is a snippet of the last couple seconds before the issue starts. timestamps are in RTC ticks.

    [05-03-2020 17:09:29][122e][INFO] 3385545> access.c: 1d5f:c5                                                                                                                         
    [05-03-2020 17:09:30][122f][INFO] 3388825> access.c: 1d5f:c6                                                                                                                         
    [05-03-2020 17:09:30][1230][INFO] 3418319> access.c: 1d5f:c5                                                                                                                         
    [05-03-2020 17:09:31][1231][INFO] 3418749> access.c: 1d5f:c4                                                                                                                         
    [05-03-2020 17:09:31][1232][INFO] 3421592> access.c: 1d5f:c6                                                                                                                         
    [05-03-2020 17:09:31][1233][INFO] 3451085> access.c: 1d5f:c5                                                                                                                         
    [05-03-2020 17:09:32][1234][INFO] 3454362> access.c: 1d5f:c6                                                                                                                         
    [05-03-2020 17:09:32][1235][INFO] 3483855> access.c: 1d5f:c5                                                                                                                         
    [05-03-2020 17:09:33][1236][INFO] 3487138> access.c: 1d5f:c6                                                                                                                         
    [05-03-2020 17:09:33][1237][INFO] 3511124> access.c: 1d5f:c1                                                                                                                         
    [05-03-2020 17:09:34][1238][INFO] 3511145> access.c: 1d5f:c2                                                                                                                         
    [05-03-2020 17:09:34][1239][INFO] 3511166> access.c: 1d5f:c3                                                                                                                         
    [05-03-2020 17:09:34][123a][INFO] 3516625> access.c: 1d5f:c5                                                                                                                         
    [05-03-2020 17:09:34][123b][INFO] 3519902> access.c: 1d5f:c6                                                                                                                    
    [05-03-2020 17:09:34][123e][INFO] 3547080> access.c: 1d5f:c1                                                                                                                         
    [05-03-2020 17:09:35][123f][INFO] 3549395> access.c: 1d5f:c5                                                                                                                         
    [05-03-2020 17:09:35][1240][ERROR] 3549405> core_tx_adv.c: packet_alloc no mem cnt 1 tevt @ 0x200086d4 ts 0xbd7eb362                                                                 

    The same log but with the timestamp post-processed and in milliseconds showing the delta between logs

     [05-03-2020 17:09:29][122e][INFO] 103318.634[+899.902] access.c: 1d5f:c5                                                                                                            
     [05-03-2020 17:09:30][122f][INFO] 103418.732[+100.098] access.c: 1d5f:c6                                                                                                            
     [05-03-2020 17:09:30][1230][INFO] 104318.817[+900.085] access.c: 1d5f:c5                                                                                                            
     [05-03-2020 17:09:31][1231][INFO] 104331.940[+13.123] access.c: 1d5f:c4                                                                                                             
     [05-03-2020 17:09:31][1232][INFO] 104418.701[+86.761] access.c: 1d5f:c6                                                                                                             
     [05-03-2020 17:09:31][1233][INFO] 105318.756[+900.055] access.c: 1d5f:c5                                                                                                            
     [05-03-2020 17:09:32][1234][INFO] 105418.762[+100.006] access.c: 1d5f:c6                                                                                                            
     [05-03-2020 17:09:32][1235][INFO] 106318.817[+900.055] access.c: 1d5f:c5                                                                                                            
     [05-03-2020 17:09:33][1236][INFO] 106419.006[+100.189] access.c: 1d5f:c6                                                                                                            
     [05-03-2020 17:09:33][1237][INFO] 107151.001[+731.995] access.c: 1d5f:c1                                                                                                            
     [05-03-2020 17:09:34][1238][INFO] 107151.642[+0.641] access.c: 1d5f:c2                                                                                                              
     [05-03-2020 17:09:34][1239][INFO] 107152.283[+0.641] access.c: 1d5f:c3                                                                                                              
     [05-03-2020 17:09:34][123a][INFO] 107318.878[+166.595] access.c: 1d5f:c5                                                                                                            
     [05-03-2020 17:09:34][123b][INFO] 107418.884[+100.006] access.c: 1d5f:c6 
     [05-03-2020 17:09:34][123e][INFO] 108248.291[+746.948] access.c: 1d5f:c1                                                                                                            
     [05-03-2020 17:09:35][123f][INFO] 108318.939[+70.648] access.c: 1d5f:c5
     [05-03-2020 17:09:35][1240][ERROR] 108319.244[+0.305] core_tx_adv.c: packet_alloc no mem cnt 1 tevt @ 0x200086d4 ts 0xbd7eb362                                                      

    Here is a log from another device with a slower packet rate prior to the event (timestamps post-processed to milliseconds)

     [05-03-2020 22:42:04][9e59][INFO] 97135.223[+162.994] access.c: 1d5f:c6
     [05-03-2020 22:42:05][9e5a][INFO] 97535.095[+399.872] access.c: 1d5f:c5
     [05-03-2020 22:42:05][9e5b][INFO] 98035.126[+500.031] access.c: 1d5f:c6
     [05-03-2020 22:42:05][9e5c][INFO] 98135.193[+100.067] access.c: 1d5f:c6
     [05-03-2020 22:42:06][9e5d][INFO] 98535.156[+399.963] access.c: 1d5f:c5
     [05-03-2020 22:42:06][9e5e][INFO] 99035.187[+500.031] access.c: 1d5f:c6
     [05-03-2020 22:42:06][9e5f][INFO] 99135.193[+100.006] access.c: 1d5f:c6
     [05-03-2020 22:42:07][9e60][INFO] 99535.217[+400.024] access.c: 1d5f:c5
     [05-03-2020 22:42:07][9e61][INFO] 100035.248[+500.031] access.c: 1d5f:c6
     [05-03-2020 22:42:08][9e62][INFO] 100135.284[+100.037] access.c: 1d5f:c6
     [05-03-2020 22:42:08][9e63][INFO] 100535.278[+399.994] access.c: 1d5f:c5
     [05-03-2020 22:42:08][9e64][INFO] 101035.309[+500.031] access.c: 1d5f:c6
     [05-03-2020 22:42:08][9e65][INFO] 101135.315[+100.006] access.c: 1d5f:c6
     [05-03-2020 22:42:09][9e66][ERROR] 101135.559[+0.244] core_tx_adv.c: packet_alloc no mem cnt 1 tevt @ 0x200086d4 ts 0x633a29ef

    We suspect its the timer_start() routine in core/src/timer.c writing a stale value into the CC1 register of RTC1. We're trying some more experiments to test this.

  • In case anyone else ever runs into this issue here is the patch we applied that solved the problem for us after 96 hours of testing.

    Shouldn't need the recovery mechanism we put in as the modifications to the timer_start and ovflw_handle appear to have stopped the issue from occurring but we left it in as good measure and because we confirmed it actually recovered publication. fixes-timer-not-firing.patch

Related