This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

NFR_ERROR_NO_MEM

After updating from THREAD SDK 2.0 to SDK 3.1 I am getting NRF_ERROR_NO_MEM in nrf_sdh.c.

The project is based from the multiprotocol BLE thread dynamic example.

The code will run for hours then after a reboot during initialization when calling the otThreadSetEnabled function;

I get the message from line 391 in nrf_sdh.c  NRF_ERROR_NO_MEM error; 

Looking in the app_sched_event_put function it seems that the event_index value is not being changed from the default value.

I have increased the SCHED_QUEUE_SIZE define from 32 to 48 but the issue is still happening.

The BLE DFU is enabled in this code.

Once this happens the radio is bricked.

What would cause the app queue to be full so early in startup

Parents
  • Hi Jay,

    I am not completely sure what the issue could be here so I consulted with our Thread team. They would like to know the following:

    1. Is the reboot triggered by the user? What do you exactly mean by saying reboot? Why does a reboot happens after some hours?

    2. What do you do to recover the board after reboot? Which steps do you take to make it run for some hours again?

    3. Maybe the scheduler queue is full? Could you try to find out what is in it?

    Best regards,

    Marjeris

  • The reboot is not caused by user.

    We have a network of 500 thread nodes for test units here in our office, after a DFU upgrade from the older code to the new code from SDK 3.1 the units will run usually for a few hours, even over night before we lose connectivity to the units. We have I2C connectivity to all units for out of band monitoring. Also not all units do it. Once we lose the connectivity we have several LEDs on the modules we can check for status. From these we can tell the unit is constantly rebooting it self. 

    I have connected the debugger to the unit and watched the boot. That is how I know the unit is failing during boot when calling the openthread enable function. The logger is reporting the NRF_ERROR_NO_MEM error when trying to access the scheduler. 

    Once a unit is in this state the unit is bricked and has to be erased and reprogrammed.

    We are testing this with different size network to see if it is node count dependent. I will let you know the results.

  • Hi Jay,

    I passed this information to the Thread team but I am still waiting for their feedback. Let me know if you find any additional information after your testing is done.

    BR,

    Marjeris

  • If we have 71 or less nodes in the network this does not happen. We have tried two different networks of 64 that worked for a weekend with no problems. We increased the networks to 80 nodes then ran them overnight. In each case 9 nodes crashed. The networks have since run several days without anymore failing.

  • Hi Jay,

    Where are you calling the app_sched_event_put function in your code?

    What was the scheduler queue size when you tested with 64 nodes? Did you tried increasing SCHED_QUEUE_SIZE with 90 nodes?

    Best regards,

    Marjeris

  • In my code I never call app_sched_event_put directly. It is called from either functions that are part of the Nordic SDK or the OPENTHREAD. In the code I have the TWIS, BLUETOOTH and OPENTHREAD running. Before updating to SDK 3.1 we have had a network of 256 nodes running with this same code for weeks. After the nodes with the SDK 3.1 code  crash I have connected a debugger to a crashed node. It is calling the app_sched_event_put function when the otThreadSetEnabled function is called. It never returns from trying to start the OPENTHREAD during startup. I have increase the SCHED_QUEUE_SIZE from the pre SDK 3.1 code. It still happens. 

    The real question is what is happening to cause the crashing that bricks the radio in the first place. 

Reply
  • In my code I never call app_sched_event_put directly. It is called from either functions that are part of the Nordic SDK or the OPENTHREAD. In the code I have the TWIS, BLUETOOTH and OPENTHREAD running. Before updating to SDK 3.1 we have had a network of 256 nodes running with this same code for weeks. After the nodes with the SDK 3.1 code  crash I have connected a debugger to a crashed node. It is calling the app_sched_event_put function when the otThreadSetEnabled function is called. It never returns from trying to start the OPENTHREAD during startup. I have increase the SCHED_QUEUE_SIZE from the pre SDK 3.1 code. It still happens. 

    The real question is what is happening to cause the crashing that bricks the radio in the first place. 

Children
  • Hi Jay,

    I have forwarded this information to our Thread team, but unfortunately they are quite busy at the moment so please be patient. I am out of ideas about what can be the problem here so I will wait to update this ticket when get an answer from them.

    Best regards,

    Marjeris

  • Hi Jay,

    Sorry for the late answer. After finally discussing this case with the Thread team we think it would be good to take a look at your code to move this issue forward. Could you share your main.c, sdk config file and linker script with us? I can also make this case private if you want before sharing these files.

    I know you already have done some debugging at your end, but have you tried to set a breakpoint at NRF_ERROR_NO_MEM (line 211) and print the variables in app_scheduler.c?

    static event_header_t * m_queue_event_headers;  /**< Array for holding the queue event headers. */
    static uint8_t        * m_queue_event_data;     /**< Array for holding the queue event data. */
    static volatile uint8_t m_queue_start_index;    /**< Index of queue entry at the start of the queue. */
    static volatile uint8_t m_queue_end_index;      /**< Index of queue entry at the end of the queue. */
    static uint16_t         m_queue_event_size;     /**< Maximum event size in queue. */
    static uint16_t         m_queue_size;           /**< Number of queue entries. */

    These variables should be initialized to zero, but one of our theories is that maybe something is causing these variables to be corrupted somehow? Maybe the linker script is outdated for SDK 3.1? Are you using the same linker scripts as for SDK 2.0?

    If you haven't done it yet you should also turn off optimization, and set the -debug flag so you can get more information when debugging, it could be really helpful.

    You also mention having LEDs on the devices, we wonder if you are using bsp_thread.c or another bsp module for this? The bsp_thread.c module has a state_changed_callback() handler which is called everytime a device role changes, which then call app timer which uses the app scheduler (see code in bsp_thread_ping_indication_set()), so another theory is that maybe it can be related to this as well...

    Best regards,

    Marjeris

  • You can close this ticket 

    We went back to using the OPENTHREAD from git not the pre-compiled libraries from the SDK.

    Problem solved.

Related