nrf_cloud_coap_connect() possible memory leak

Hello

I am using a custom board, based on nRF9160. I'm using SDK 2.7.0 and modem firmware 1.3.6.
After a bug in production where a device stopped sending data, I tracked down the issue to this function nrf_cloud_coap_connect().
In my specific case, after 55 successful calls, the function ends up returning -12 (ENOMEM 12    /* Not enough space */).
Any call after that returns the same error.

I guess the number of call depends on the firmware. Compiler states that my project uses ~33% of the available RAM, or 70kB.

The issue was seen a few times in production (deplyed devices) and I managed to reproduce with the following piece of code :

// General init

// Connect to LTE-M network

for(i=0;i<100;i++)
{    
    DEBUG_PRINT("Attempt %d\r\n", i);
    
    // Ensure there is no remaining stuff opened
    nrf_cloud_coap_disconnect() ;

    // Create new connexion
    err = nrf_cloud_coap_connect(NULL) ;
}


This not due a network issue, as coverage is good in my office, and the issue is fully repeatable.

In step by step debug, the error pops up in nrf_cloud_coap_transport.c
At line 319, nrf_cloud_coap_connect() calls nrf_cloud_coap_transport_authenticate(), which errors (-1) at line 805.

Is this a known issue ? How can I solve this ?
I was expecting that calling nrf_cloud_coap_disconnect() was enough to close any opened thread/variable/wathever, but it seems something is eating the memory at every iteration.

Thanks




  • Hi Vincent,

    Thanks for reporting this issue. I checked the rlease notes of NCS and MFW, but there is no related bug fixes mentioned.

    I see Cellular: Modem Shell codes provides a similar way for nrf cloud connect and disconnect. I would arrange some time to try on latest NCS 2.9.0 and MFW 1.3.7 frist tomorrow. You can also did a test with it to see if it is repeatable with this offical sample.

    Best regards,

    Charlie

  • Hi Vicent,

    Sorry for the late reply. I did a test with NCS 2.9.0 Cellular: Modem Shell and MFW 1.3.7 on nRF9160DK.
    It repeats nRF connect and disconnect 100 times and there is no error happens.

    You can find the testing script and test log. The nrf_cloud_coap_connect function should have no memory leak problem.

    import serial.tools.list_ports
    import serial
    import time
    from datetime import datetime
    
    class CloudConnectionTester:
        def __init__(self, port=None):
            if not port:
                port = self._select_com_port()
            self.ser = serial.Serial(port, baudrate=115200, timeout=10)
            self.stats_reset()
    
        def stats_reset(self):
            self.total_cycles = 0
            self.successful_connections = 0
            self.failed_connections = 0
            self.successful_disconnections = 0
    
        def _select_com_port(self):
            while True:
                print("\nAvailable COM ports:")
                ports = serial.tools.list_ports.comports()
                for i, p in enumerate(ports):
                    print(f"  [{i+1}] {p.device} - {p.description}")
                    
                try:
                    selection = int(input("Enter port number or 0 to refresh: "))
                    if selection == 0:
                        continue
                    return ports[selection-1].device
                except (ValueError, IndexError):
                    print("Invalid selection! Please try again.")
    
        def _execute_command(self, command, success_indicator):
            try:
                print(f"\n>>> Sending command: {command}")
                self.ser.write(f"{command}\r\n".encode())
                start_time = time.time()
                response_buffer = ""
                
                while time.time() - start_time < 30:  # 30s timeout
                    line = self.ser.readline().decode().strip()
                    if line:
                        timestamp = datetime.now().strftime('%H:%M:%S.%f')[:-3]
                        print(f"[{timestamp}] {line}")
                        response_buffer += line + "\n"
                        
                        if "failed" in line.lower() or "rejected" in line.lower():
                            return False
                        if success_indicator in line:
                            return True
            
                return False
                
            except Exception as e:
                print(f"Serial communication error: {str(e)}")
                return False
    
        def connection_cycle(self):
            self.total_cycles += 1
            success = False
            
            # Connection phase
            connected = self._execute_command(
                "cloud connect", 
                "nrf_cloud_coap_transport: DTLS CID is active"
            )
            
            if connected:
                self.successful_connections += 1
                print("\033[92mConnection successful\033[0m")  # Green text
                time.sleep(5)  # Maintain connection briefly
                
                # Disconnection phase
                disconnected = self._execute_command(
                    "cloud disconnect", 
                    ""
                )
                
                if disconnected:
                    self.successful_disconnections += 1
                    print("\033[92mDisconnection successful\033[0m")
                else:
                    print("\033[91mDisconnection failed!\033[0m")  # Red text
                
                return True
                
            else:
                self.failed_connections += 1
                print("\033[91mConnection failed!\033[0m")
                return False
    
        def generate_report(self):
            success_rate = (self.successful_connections/self.total_cycles)*100 if self.total_cycles else 0
            
            report = """
    ==========================================
           Connection Test Summary       
    ==========================================
    Total cycles:          {}
    Successful connections: {} ({}%)
    Full success cycles:    {} ({}%)
    Failed connections:     {}
    Disconnect failures:    {}
    ==========================================
    """.format(
        self.total_cycles,
        self.successful_connections,
        round(success_rate),
        self.successful_disconnections,
        round((self.successful_disconnections/self.total_cycles)*100) if self.total_cycles else 0,
        self.failed_connections,
        self.successful_connections - self.successful_disconnections,
    )
                
            print(report)
            
    
    if __name__ == "__main__":
        tester = CloudConnectionTester()
        
        try:
            while True:
                cycles = int(input("Number of test cycles (0 to quit): "))
                if cycles <= 0:
                    break
                
                interval = int(input("Interval between cycles (seconds): "))
                
                for i in range(cycles):
                    result = tester.connection_cycle()
                    tester.generate_report()
                    
                    # Only wait if continuing tests and not last cycle
                    if i < cycles-1 and interval > 0:  
                        print(f"\nWaiting {interval} seconds...")
                        time.sleep(interval)
                        
        except KeyboardInterrupt:
            print("\nTest interrupted by user!")
        
        finally:
            tester.generate_report()
    

    2703.test.log

    Best regards,

    Charlie

  • I made a similar test with modem 1.3.6 and SDK 2.8.0, and it seems to work fine.

    Looks like there was something wrong with SDK 2.7.0.

    Thanks,

Related