Doc for reliable connection missing

Hi,

while browsing through the Q&A I came across this thread:

How to recover a nRF9160 from sporadic ENETDOWN

Conclusion is: if a (UDP socket) sendto returns en errno of ENETDOWN the modem has disconnected and I am supposed to close and reoopen
the socket. Fine.

But honestly, without randomly browsing through Q&A how should I have known this? Is there any documentation on how to
"just keep the network up and be able to send data?". Without finding this I would have guessed that, ok, network is down, it
will reconnect and my socket will work again.

Are there other cases of errors that do not recover by themselfes and action by the application is required?
I am aware that the action is application depended (how log to wait until trying reconnect to not drain batteries or stuff like this).
And yes I am supposed to do testing. But handing rare cases is hard to test and trying to find out the "black box" behavior by testing
is also hard.

A writeup of what can happen on how the application should react would be really nice. Or is there allready an overview?

Spending hours in the Q&A browsing through topics to find out if there is an other case I need to act on is not really
a solution.

Regards,

Clemens

0 Achim Kraus over 2 years ago

So true, such a documentation would be very helpful!

I tried also to investigate this topic, see

how to recover a nrf9160 from sporadic enetdown

nrf9160 sent udp message do not longer reach destination after 4 weeks

and the really nasty one

mfw_nrf9160_1.3.2.zip behavior change on lost network

(That means, even if you find a "good enough" solution for version x.y, you need to re-verify that on an update. OK, for now, that will not be the only function to be verified on updates.)

Others are also struggling

lte-m reconnect after los

What seems to be common:

Nordic prefers to have traces, but tracing with an PC in the wild or for long term isn't really possible.

So the pain has many aspects:

- the missing information from Nordic

- hard to find out by your self

- hard to fix issues, if traces are not possible for users.

I currently redesigned my own approach (for mfv 1.3.2), I will see, if that works then more stable on network side trouble.

(Just on the case, you're interested: In the meantime I check for SOCKET_ERRORS, If I find one, I check/wait, if/until the modem is "Ready" (RRC connected, registered, PDN active) and then reopen the socket and resend the last message again.)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ClemensG over 2 years ago in reply to Achim Kraus

Hi Achim, hi Nordic people

yes, I already came across your (good, valid) topics in the Q&A. And many thanks for the roundup.

I am trying to get one of the most simple, typical cases (I guess) up and running:

1 connect to the network
2 sleep
3 take some measurements and send them via udp
4 wait for acknowledge udp packets
5 (let the modem enter psm)
5 goto to 2

While reading the docs you get the impression: easy to do. Call lte_lc_connect() and the modem takes care of keeping
connected to the network and I can do usual socket programming.

Not the case. Somehow things get more and more complicated. Of cause I am aware that the modem has it's own state-machine and I have to somehow synchronize with that. It's more that there is no documentation on how this should be done. Yes, it might be hidden in the AT command reference (but then: no documentation eg on what lt_lc_xxx is acutally doing/taking care of), in the official 3gpp documentation and inside posix (might be it is stated there that ENETDOWN requires you to reopen the socket). But while everything might be technically correct: please Nordic provide an overview.

You start to subscribing to more and more modem states messages and event handlers and my fear of getting race conditions and deadlocks grows with every line I am writing. To avoid exactly this I chose the nrf9160 instead of other modules with serial AT commands. I thought: Hey there is an sdk with an os and they sort these things out for me. But somehow all the states to be handled are now spread across return codes, handler events (always have to look up in which context these handlers are called: own task - system workqueue, isr).

I would really love to see a few pages written by nordic, explaining there idea of how users should use that system/sdk/os and what are the errors and/or states that I am supposed to act on and how. And I am just talking about regular functionality and not bugs (fully understand that those are also at the party).

Maybe a sample showing how to do this. No, not the Asset Tracker Monster. That is way to extensive to serve as basic example. And if someone argues that this is the way it needs to be done, then something's wrong with the basic api.

Just another example:

I already changed step 4 for to just wait with some timeout as doing it the posix way does not seem to be possible:
Have one thread poll() on network socket (wakeup on receiving packet) and a socketpair (wakeup when an other thread tells me there is new data to be sent).
Not possible as poll() on network sockets is using the nrf lib which does not know anything about zepyhr's socket pairs. Okay, I can understand this technically, but this is missing capability is not documented. Well, still I don't know what poll() would do if the socket would become unavailable due to my initial post. Can't find documentation.
For me it's ugly but fine to just poll for a few seconds so that should work. But what if someone want's to receive packets in between (eDRX)? Having an extra task, for receiving? What to do on ENETDOWN?

As for now I am opening my socket to tx and rx than closing the socket (should get rid of ENETDOWN). My protocol layer takes care of resending the data the next time if no ack was received. Still I found the modem somehow getting stuck (while consuming non idle current) just yesterday (sadly no logs). Maybe I need to subscribe to other result codes... (see above).
Or somehow detecting that, resetting the system (and having to deal to get my stale datapacket into the new session...).

Somehow it feels I do not get the idea of how to use the sdk and I am the only one. Am I missing something? Is there some documentation of how it is supposed to be done that every one else read?

Regards

Clemens
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Achim Kraus over 2 years ago in reply to ClemensG

> and I am the only one.

Sure not. I guess, many are just reboot more frequently.

And also many don't want to spend the time to prepare the traces and go through the "questionnaires".

> Asset Tracker Monster

If I understand lte-m reconnect after los right, then the monster is also not "rock solid" ;-).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Achim Kraus over 2 years ago in reply to ClemensG

> Not possible as poll() on network sockets is using the nrf lib

Oh, so I have to stay with "select". I planed to switch to "poll", but with that, I will save the time.

By the way, select for "exceptions" works, so that's the way I now detect "network errors".

If you're interested, I post my logging outputs when sending data (UDP), while I reset my "connection" at the SIM card providers Web-Page.

But I would also prefer a clear documentation. All that "try-and-error" isn't the way I like it.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 ClemensG over 2 years ago in reply to Achim Kraus

Really? select() works when having an offloaded network socket and a socketpair socket at the same time in the fd_set?

Because poll() returns ZSOCK_POLLNVAL in the revents for the socketpair socket. Works with only the socketpair socket or only the networking socket.

I have not exactly tired what it does with exceptions. For now I'll just call recfrom and if it fails it fails. I just treat it as "ack not received" ie "nothing received" and the data will be resend at the next upload (always with a fresh socket). This works as I am not waiting for data inbetween. Just PSM, no eDRX.

Still, handling the ENETDOWN thing (closing and reopening the socket) is even not covered in the nrf9160 udp sample. And should be covered there.

BTW just captured that case again.

ncs 2.1.2, modem FW 1.3.3, LTE-M (network: one of the three german: sim is from 1nce thus roaming across all).

So what happens is (this is using nearly the original example lte_handler() with some application debug output):
1) I am in PSM and write data to the socket
2) Modem signals a cell update
3) goes into RCC connected
4) recvfrom returning ENETDOWN (old fw, recfrom is trying every 100ms here )
5) later the modem reregisters.

          [04:50:43.418,670] <dbg> tbws: main: Time to upload
          [04:50:43.419,372] <dbg> tbws: main: encoded measurement payload: 161
          [04:50:43.419,433] <dbg> llp: llp_enqueue_packet: storing packet seq_no 61 @idx 0
          [04:50:43.419,494] <dbg> tbws: main: Transmitting
(1) -> [04:50:43.428,253] <dbg> llp: tx_packets: sending package seq_no 61 (ret: 0) @idx 0
          [04:50:43.435,028] <dbg> llp: llp_transmit: 1 measurement packets sent
          [04:50:43.435,058] <dbg> llp: llp_transmit: sending generic packets
(2)-> [04:50:49.427,978] <inf> tbws: Cell ID: 18028801, Tracking area: 47300
(3)-> [04:50:49.943,603] <inf> tbws: RRC mode: Connected
(4)->   [04:50:50.052,154] <err> llp: recvfrom error: 115
          [04:50:50.152,374] <err> llp: recvfrom error: 115
          [04:50:50.252,624] <err> llp: recvfrom error: 115

and later:

04:50:52.758,880] <err> llp: recvfrom error: 115
[04:50:52.859,130] <err> llp: recvfrom error: 115
[04:50:52.893,737] <inf> tbws: Network registration status: Connected - roaming

[04:50:52.894,561] <inf> tbws: PSM parameter update: TAU: 90000, Active time: 16
[04:50:52.959,381] <err> llp: recvfrom error: 115

I guess what happens here is that the modem finds a better cell but in a different network and
has to reregister. This seems me to be forced to close and reopen the socket.

Checked the logs: Yes it is changing from nmc:1 (telekom) nmc: 2 (vodafone)
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel