nRF9160 can't connect to working server

Hello,

I've been developing code on nRF9160 for a while. In the beginning everything was straight forward and working just fine also the code was pretty simple. Now like 70% of the time when I try to connect to a server (which I know is working and network is also just fine) the modem hangs and cant connect.

In past months I noticed that the modem very rarely gets in some weird state and it's not able to connect to network anymore - I found some other members in DevZone have noticed the same issue, but that's kinda okay since it was very rarely. In past week of extensive testing I have noticed that `connect()` from socket.h just do not want to connect to server anymore. I pinned down that `retval = nrf_connect(sd, (struct nrf_sockaddr *)&ipv4, sizeof(struct nrf_sockaddr_in));' from nrf91_sockets.c line 476 and 477 is the line where the code hangs. I can't step in any deeper in this function since visual studio isn't able to fine the `nrf_connect()` function 

I don't see any logs on server side so I believe that the issue is somewhere in-between my code and modem? Anyway it is getting really annoying and I'm not sure how to resolve this issue. The error returned by the `connect()` call after 2 minute timeout is: "Failed to connect socket, error: 114". Error # 114 responds to no network available but I know for sure that network is available. I strongly believe that the issue is not with the network but modem itself because "LTE cell changed: Cell ID: 2808860, Tracking area: 41120" notifications come in pretty often and the RSRP, RSRQ and SNR values are -98 ≤ RSRP < -97 dBm, -14 ≤ RSRQ < -13.5 dB, 13 dB ≤ SNR < 14 dB just before calling the `connect()` function. Are these parameters okay to establish an UDP connection? 

A bit more context about the project - at startup nRF9160 creates a CoAP + DTLS socket and connects to our server. Afterwards it just waits for an input on UART1 (from a different MCU) and when data are received LTE module encodes the data and sends it to server. Of course there are some additional peripherals running like timers, watchdog and basically that's it. 

SDK and toolchain version 2.6.0, modem firmware version mfw_nrf9160_1.3.6

Edit: The longer I think about this issue, the more I start to believe that it could be network problem. I started working from a new place 2 weeks ago, and one week after moving I continued doing stability tests on our nRF9160 controlled device. And also this is the time when I started noticing the connection issues. I did not want to believe that it is network related since I am still in a city center and the network coverage is good for 4G, also the LTE-M coverage map shows that the device should be in zone but more and more it seems like that it really is signal issue..

  • > I don't see any logs on server side ... our server

    If you don't provide information about the server (e.g. which implementation and version), it will be hard to see the cause.

    If you don't get logs on the server side but have access to the vm/pc the server is running on, a ip-capture from the server side may help to see more. Or a modem-trace may also help to see more.

    For ip-captures you may follow the instruction in this link to Californium's Wiki.

    For a modem-trace you will find the infos here in this forum.

  • Well the problem is I don't have direct access to these server logs. I asked the system administrator to check if any messages are incoming when this issue happens, but he did not see anything out of the ordinary. I'll try to use the modem trace 

  • And do you know, which implementation is used on the server side?

  • I only know that it runs on AWS IoT core.

    But I managed to find the issue. For some reason the RRC is not connected when the device tries to create a socket. I noticed this just this morning when also the messages did not get delivered to server. It seems like a weird bug when `connect()` and `send()` does not change the RRC mode (RRC stays in idle) although the socket opening and CoAP packet creation logic has not changed (only some logging and double checks added and that's it). As a template I took this project and just added the stuff I needed for logging. I reverted the code and now all runs smoothly. 

    Anyway thanks for your input, much appreciated! Now I need to pinpoint what exactly causes this to be able to fix it in code. If by any chance you have seen something similar and have idea which peripheral or setting might cause this, let me know! 

Related