Doc for reliable connection missing

Hi,

while browsing through the Q&A I came across this thread:

 How to recover a nRF9160 from sporadic ENETDOWN 

Conclusion is: if a (UDP socket) sendto returns en errno of ENETDOWN the modem has disconnected and I am supposed to close and reoopen
the socket. Fine.

But honestly, without randomly browsing through Q&A how should I have known this? Is there any documentation on how to
"just keep the network up and be able to send data?". Without finding this I would have guessed that, ok, network is down, it
will reconnect and my socket will work again.

Are there other cases of errors that do not recover by themselfes and action by the application is required?
I am aware that the action is application depended (how log to wait until trying reconnect to not drain batteries or stuff like this).
And yes I am supposed to do testing. But handing rare cases is hard to test and trying to find out the "black box" behavior by testing
is also hard.

A writeup of what can happen on how the application should react would be really nice. Or is there allready an overview?

Spending hours in the Q&A browsing through topics to find out if there is an other case I need to act on is not really
a solution.

Regards,

Clemens

  • > select() works when having an offloaded network socket and a socketpair socket at the same time in the fd_set?

    Sorry, I was misleading. I have only 1 socket, that's the UDP network socket. I select for data available for read and for exceptions (socket errors). In case of network loss, my observation is, that you don't receive a response, but you get the "exception" and then you're able to read the socket error with

    getsockopt(fd, SOL_SOCKET, SO_ERROR, &error, &len)

     In my case, I only wait to receive data for a very short time after sending the request. So, within the "message exchange", that application isn't able  to push new data to send. Once I received the response, I wait for an other "event" (mutex) and that triggers again the message exchange.

    So, the basic improvement is to select for both, received data and/or socket errors.

    > (network: one of the three german: sim is from 1nce thus roaming across all).

    With the 1NCE website you can trigger a "disconnect" in order to test the behavior.

    About your logs:

    I call "recvfrom" only, if select reports data to read. In my case therefore it usually not returns errors. Before select for errors, I got the errors on retry to send the pending request. Now I detect the error earlier.

  • Hi,

    ClemensG said:
    Well, still I don't know what poll() would do if the socket would become unavailable due to my initial post. Can't find documentation.

    If the network connection is lost, or the modem is turned off, POLL should return with a POLLERR on the sockets that has been closed by the modem.

    You can also send() on a socket that you are poll()ing on, in a different thread. So you don't really have to synchronize the sending and the receiving thread, other than the state of the connection.

    You can see one example of how this can be done in the mqtt_simple sample.

    But, I agree that this isn't documented well enough, and I will bring your feedback to our developers.

    Best regards,

    Didrik

  • Thanks for clarifying, that poll works for a modem socket. I already adapted my code as planed before.

    I guess, the pain for ClemensG comes with the "socket_pair", but that's not used by me.

  • Achim, first of all thank you very much for really helping me with pointing out a few things, it really helped.

    Still, my intention here was to give Nordic a hint, that it's really hard to find out those things and I feel there is a lack of documentation here.

    "When the modem reregisters with a new network, that might happen when when another cell from another network provider is selected (with a roaming enabled sim card), all sockets are closed by the modem and need to be reopened by the application.

    A closed socket will be signaled to the application by sendto() returning <0 and setting errno to ENETDOWN"

    I am missing the document stating these sentences and possibly other ones like that that tell users of the sdk
    some basic "best practices". Especially when they are not obvious or different from other systems (when programming under Linux this case would most likely not be an issue). Well I was lucky as I am in a city with many cells around so that that case happened during testing at the bench. I guess otherwise that may have shown very late in the project.

    Maybe these things seem very obvious if your a long term user of the nrf9160 but hard to figure out when just getting into it. So my comments are more a feedback for Nordic on where new users struggle.

    Same holds true for the thing with poll() and network sockets at the same time with socketpairs. It's okay, when you can do send and receive on a socket in different threads but still in the posix world must peaople would see that as a bad design and would most likely start with the socketpair and fail first.

    I do also guess there are more things "hidden" in the modem/sdk state machine that might be useful to know and to have some recommendations on how to act (when to reset the modem, reopen a socket,what is allowed what not, ...). Just the things that are are known by design. Of cause there will be bugs also.

    All in all: Nordic, this is just feedback from a new user who is hoping for a "Best practices" document on the software side. Having to look through all samples and guess why what is done and what is relevant is really hard work that could be really simplified.

    Pretty sure many others stumbled across this and had to find out this by themselfes.

  • Over the last months I was already considering to ask for something similar as a FAQ, with the outcome and current recommendation of questions here in the forum. I guess, that FAQ would require maintenance, some of the topic are evolving and the "best practice" may change over the time.

    Candidates:

    - "rock solid" communication, sockets, network loss

    - tracing in the wild and/or for long term

    - RAI: CP/AS? TCP/UDP?

    - Energy consumption general

    -- Energy consumption nRF9160-DK specific (e.g. UART)

    -- Energy consumption Thingy:91 specific (e.g. UART, 3.3V for version < 1.6)

    Just some for a list.

Related