This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Mesh Fails to publish and relay packets until RTC timer wraps around

Using Mesk v3.1 on an 52840 chip. 

We've noticed that after running our app for several hours or days that our models become unable to send messages for a period of ~8.5 minutes. Once the first publication fails (error code 0x4) all further publications (for all models) fail with the same code until the RTC1 timer wraps around. We are only sending unsegmented packets. 
The value of the RTC timer in the cases we have observed (via logging over RTT) is around 0x90000, so somewhere after the half-way point but no where near the overflow.
We suspected the timer_scheduler module is failing to be triggered and that NO messages were being sent out (originator or relay) so we modified the SDK slightly to track the number of packet complete callbacks being hit to verify this (diff below):

-- a/Dependencies/nRF5_Mesh_SDK/3.1.0/src/mesh/core/src/core_tx_adv.c
+++ b/Dependencies/nRF5_Mesh_SDK/3.1.0/src/mesh/core/src/core_tx_adv.c
@@ -77,11 +77,30 @@ static core_tx_bearer_t m_bearer;
 * Static functions
 *****************************************************************************/
 
+volatile uint32_t relayCounter = 0;^M
+volatile uint32_t originCounter = 0;^M
+^M
+static void txRelayCb(advertiser_t * p_adv,^M
+                                     nrf_mesh_tx_token_t token,^M
+                                     timestamp_t timestamp)^M
+{^M
+    const core_tx_role_t role = (core_tx_role_t) (PARENT_BY_FIELD_GET(adv_bearer_role_t, advertiser, p_adv) - &m_bearer_roles[0]);^M
+^M
+    if (CORE_TX_ROLE_RELAY == role) {^M
+        relayCounter++;^M
+    }^M
+}^M
+^M
 static void adv_tx_complete_callback(advertiser_t * p_adv,
                                      nrf_mesh_tx_token_t token,
                                      timestamp_t timestamp)
 {
-    core_tx_role_t role = (core_tx_role_t) (PARENT_BY_FIELD_GET(adv_bearer_role_t, advertiser, p_adv) - &m_bearer_roles[0]);
+    const core_tx_role_t role = (core_tx_role_t) (PARENT_BY_FIELD_GET(adv_bearer_role_t, advertiser, p_adv) - &m_bearer_roles[0]);^M
+^M
+    if (CORE_TX_ROLE_ORIGINATOR == role) {^M
+        originCounter++;^M
+    }^M
+^M
     core_tx_complete(&m_bearer, role, timestamp, token);
 }
 
@@ -312,7 +331,7 @@ void core_tx_adv_init(void)
 #if MESH_FEATURE_RELAY_ENABLED
     m_bearer_roles[CORE_TX_ROLE_RELAY].adv_tx_count = CORE_TX_REPEAT_RELAY_DEFAULT;
     advertiser_instance_init(&m_bearer_roles[CORE_TX_ROLE_RELAY].advertiser,
-                             NULL,
+                             txRelayCb,^M
                              m_relay_adv_packet_buffer,
                              sizeof(m_relay_adv_packet_buffer));
     advertiser_enable(&m_bearer_roles[CORE_TX_ROLE_RELAY].advertiser);

During normal operation these counters increment as you would expect but during the 8.5 minutes of failures, these counters do not move. 

We are using app_timer_mesh and have some app_timer instances in our code running in repeated mode only. During the failure period, those timers continue to operate normally and we continue to receive packets from other devices. we just can't respond.

Below are snippets of the logs showing the RTC values of when the errors start and when normal operation resumes

[INFO] 8996078> access.c: 1d5f:c6                                                                                                                                                    
[DEBUG] 8996096> svmodel.c: Pub Response to 0x6: 
[DEBUG] 9001746> svmodel.c: Query Rx {0x21}:                                                                                         
[INFO] 9002632> access.c: 1d5f:c6                                                                                                                                                    
[ERROR] 9002639> svmodel.c: Fail(0x4) Pub Response to 0x21:                    
[DEBUG] 9004635> svmodel.c: Query Rx {0x4}: treeId:0x6cc300ca, depth:4, orphaned:0                                                                                                
[INFO] 9005909> access.c: 1d5f:c6                                                                                                                                                    
[ERROR] 9005916> svmodel.c: Fail(0x4) Pub Response to 0x4: 
...
continues like this
...
[DEBUG] 8951221> svmodel.c: poll Enter: ACTIVE/1
[INFO] 8951228> access.c: 1d5f:c5
[ERROR] 8951237> svmodel.c: Fail(0x4) Pub Query to 0xc003: 
[DEBUG] 8951245> svmodel.c: poll Exit: ACTIVE/1
[INFO] 8984000> access.c: 1d5f:c5
[DEBUG] 8984021> svmodel.c: Pub Query to 0xc003:
...
no more errors past this point.

        

I've reviewed the code delta's pertaining to the timer_scheduler, timer, app_timer_mesh, core_tx_adv and advertiser between v3.1, v3.2 and v4 and haven't seen any significant deltas.
Any ideas on what this could be?

Parents Reply Children
No Data
Related