This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

AF_LTE socket and nrfxlib

After many hours of successful LTE connections and using the AF_LTE socket just fine (many creates, many send/receive, many closes), my application now hangs on a call to socket(AF_LTE,0,NPROTO_AT). Hung, as in the call to socket() does not return. I have no idea where to look to resolve this. Is the nrfxlib code available for review? I'll sign an NDA if I can get my eyes on it. I've spent way too much time trying to code around strange behavior in nrfxlib/nrf/zephyr.

This is running on nrf9160 DK, modem fw v1.1.1 and NCS v1.2.0

Mike

Parents

0 Didrik Rokhaug over 5 years ago

Hi.

I can not give you access to the bsdlib source code. However, I can try to help you find the source of the problem.

Do you use the at_cmd library to send AT commands?

Do you see similar behavior on other types of sockets?

How many other sockets are you using (and are you using the lwm2m_carrier library or other libraries that might use sockets)?

Are you able to capture a modem trace that captures the problem?

Best regards,

Didrik
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 miked531 over 5 years ago in reply to Didrik Rokhaug

Hi,

I do not use at_cmd, I rolled my own (prior to when the at_* libs were mature enough for my use)

I have seen various socket issues over the past several months and have several tickets in devzone. They have all(?) been resolved by now.

I should have no more than 3 sockets open at once (AF_LTE for monitoring the modem is always open and normally waiting on recv(); AF_LTE for commanding the modem, only occasionally; AF_INET for send/recv of UDP data once we are connected to the network). My occasional AF_LTE socket for commanding keeps getting fd=1 or 2 when created, so I don't think I am leaking them anyplace. I'm not using any additional libs that should be using sockets (and not using lwm2m_carrier).

This happens very infrequently and I cannot get modem traces as my application also uses the nrf52840 on the DK.

I'm curious what the socket call might be doing that would cause it to hang? I can deal with errors, but hanging threads is much more difficult. If you can't share the bsdlib source, can you give some insight to what may be happening?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Didrik Rokhaug over 5 years ago in reply to miked531

I let your updated sample run for close to 7 hours, but I am not able to see any hangs. Though maybe I just don't know what to look for. The log is attached.

I also modified your application to enable modem tracing. Could you try to run it, and see if you are able to capture a trace of the hangup?

socket_maybe_stuck.txt broken_w_trace.zip
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 miked531 over 5 years ago in reply to Didrik Rokhaug

I am now not able to create the hang even with my own code, so I have updated the code and attached it again as 'broken3.zip'. In short, I took out the AT+CFUN=1 and spawn several threads to abuse the modem with AT commands. During the handful of times I have run it, it sometimes lasts for a full minute before hanging. It starts getting timeouts then all the output eventually just stops.

The other problem I see is that it sometimes just does a board reset and starts over with the "booting zephyr..." message, with no accompanying information, panic, etc.

It is not performing a real-world scenario, but is able to consistently produce a hang on the socket(AF_LTE) call.

broken3.zip
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Didrik Rokhaug over 5 years ago in reply to miked531

With your new code, I am able to reproduce the error, though it takes a lot longer than 1 minute to see the error. On my latest run, the error came after 17,5 minutes.

In one of my traces, the modem reports that it is out of RAM used to communicate with the application. The trace also shows that you are not waiting for a response from the modem before you send a new command.

In a real application (i.e. not one made to provoke an error), you should both take care not to send new commands before the previous one got a reply, and that you read back any data (this also goes for IP or GNSS sockets) fast enough.

However, as the modem does crash at the end, I have informed the modem team and asked if they can take a look at it.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 miked531 over 5 years ago in reply to Didrik Rokhaug

I'm glad you were finally able to reproduce the error. I'm curious why it took much longer for you though. My tests are running on 9160dk rev 0.8.2.

Is there anything inherently wrong with the design of the code regarding how it accesses the modem? Your answer above suggests that perhaps a mutex would be needed when issuing AT commands.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Didrik Rokhaug over 5 years ago in reply to miked531

The response from the modem team was mostly as I expected:

You should wait for a response to the AT commands you send, and ensure that you read back the data sent from the modem so that you don't run out of memory used to communicate with the modem.

This is also probably what causes the application to hang. As there is no more memory that can be used to communicate with the modem, the application is not able to open new sockets or sending new commands.

miked531 said:
Your answer above suggests that perhaps a mutex would be needed when issuing AT commands.

Yes, using a mutex to restrict the access to the modem to only one thread at a time could solve some of your problems. The main point though is that you should wait for a reply before you send a new AT command.

However, both of the traces show that the modem crashes at the end. Both the crashes are due to the same cause, for which the modem team already has a fix. Future releases of the modem firmware will have this fix, and will hopefully solve the problem you are seeing in your original application.

I suggest that you take another look at your AT command library, and try to limit it to just sending one command at a time.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Didrik Rokhaug over 5 years ago in reply to miked531

The response from the modem team was mostly as I expected:

You should wait for a response to the AT commands you send, and ensure that you read back the data sent from the modem so that you don't run out of memory used to communicate with the modem.

This is also probably what causes the application to hang. As there is no more memory that can be used to communicate with the modem, the application is not able to open new sockets or sending new commands.

miked531 said:
Your answer above suggests that perhaps a mutex would be needed when issuing AT commands.

Yes, using a mutex to restrict the access to the modem to only one thread at a time could solve some of your problems. The main point though is that you should wait for a reply before you send a new AT command.

However, both of the traces show that the modem crashes at the end. Both the crashes are due to the same cause, for which the modem team already has a fix. Future releases of the modem firmware will have this fix, and will hopefully solve the problem you are seeing in your original application.

I suggest that you take another look at your AT command library, and try to limit it to just sending one command at a time.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 miked531 over 5 years ago in reply to Didrik Rokhaug

In my quick test that I just coded up, the mutex around the AT and response has apparently resolved the issue.

Do you have an estimate as to when the modem fix will be released?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Didrik Rokhaug over 5 years ago in reply to miked531

Great to hear that it seems to work now!

I can not comment on future releases. For that, you should contact your Regional Sales Manager.

If you do not know how to contact your RSM, you can send me a private message with your location, and I will provide you with the contact information.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Didrik Rokhaug over 5 years ago in reply to Didrik Rokhaug

Hi.

A new version of the modem firmware (v1.2.0) was just released and has this bug fix.

The bug fix will also be present in future patch releases for the 1.0.x and 1.1.x versions.

Best regards,

Didrik
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 miked531 over 5 years ago in reply to Didrik Rokhaug

I ran my non-mutex test against modem 1.2.0 and the application produces the same results (an eventual hang) as with 1.1.1. While the mutex does allow my application to work correctly, it does not look like my initial error has been resolved fully in the new firmware.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Didrik Rokhaug over 5 years ago in reply to miked531

I re-ran your program and is also getting a modem crash.

However, to me, it does not look like it crashed for the same reason as with mfw v1.1.1.

I have asked the modem team to take a look at my modem trace to confirm.

But again, I would like to point out that the application is very abusive, and I would not be very surprised if the modem team replies that it is due to the application not waiting for a reply.

Regardless of the cause of the bug, I would recommend that you keep your mutex in place.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel