This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

AF_LTE socket and nrfxlib

After many hours of successful LTE connections and using the AF_LTE socket just fine (many creates, many send/receive, many closes), my application now hangs on a call to socket(AF_LTE,0,NPROTO_AT). Hung, as in the call to socket() does not return. I have no idea where to look to resolve this. Is the nrfxlib code available for review? I'll sign an NDA if I can get my eyes on it. I've spent way too much time trying to code around strange behavior in nrfxlib/nrf/zephyr.

This is running on nrf9160 DK, modem fw v1.1.1 and NCS v1.2.0

Mike

Parents
  • Hi.

    I can not give you access to the bsdlib source code. However, I can try to help you find the source of the problem.

    Do you use the at_cmd library to send AT commands?

    Do you see similar behavior on other types of sockets?

    How many other sockets are you using (and are you using the lwm2m_carrier library or other libraries that might use sockets)?

    Are you able to capture a modem trace that captures the problem?

    Best regards,

    Didrik

  • Hi,

    I do not use at_cmd, I rolled my own (prior to when the at_* libs were mature enough for my use)

    I have seen various socket issues over the past several months and have several tickets in devzone. They have all(?) been resolved by now.

    I should have no more than 3 sockets open at once (AF_LTE for monitoring the modem is always open and normally waiting on recv(); AF_LTE for commanding the modem, only occasionally; AF_INET for send/recv of UDP data once we are connected to the network). My occasional AF_LTE socket for commanding keeps getting fd=1 or 2 when created, so I don't think I am leaking them anyplace. I'm not using any additional libs that should be using sockets (and not using lwm2m_carrier).

    This happens very infrequently and I cannot get modem traces as my application also uses the nrf52840 on the DK.

    I'm curious what the socket call might be doing that would cause it to hang? I can deal with errors, but hanging threads is much more difficult. If you can't share the bsdlib source, can you give some insight to what may be happening?

  • miked531 said:
    I cannot get modem traces as my application also uses the nrf52840 on the DK.

     It should not be a problem that you are using the nRF52840 on the DK, as long as P1.12 is kept low, and you do not use UART1/SPI1/TWI1 on the nRF9160.

     

    miked531 said:
    This happens very infrequently

     Do you know if there is anything different with the ones where it hangs? Does it happen regularly or at random intervals?

    We are currently investigating a very similar problem that also leads to socket() hanging. However, in that case, we are trying to open a TLS socket, and it happens very consistently.

    One thing you could try is to use the previous version of bsdlib. You can do that by changing the revision of nrfxlib on line 72 in ncs/nrf/west.yml from 1.2.0 to 3e381d09f1d305e230435f5b6e4c9ef928b6a697, then running "west update".

  • Sorry for the delay, I have been on vacation.

    I would be happy to get the modem traces, but I do use UART1 in my app. I have captured traces before, but it involved programming the 52840. If you have different instructions, please point me to then and I'll try.

    I have only seen the socket() call hang on less than 5 occasions and I nothing looked different when it happened, though I didn't do an exhaustive check. I do encounter other issues when I don't receive a response from the modem and it happens more frequently (perhaps 1/20 times). That sequence is simply:

    1. socket(AF_LTE)
    2. send AT+CMEE=1
    3. setsockopt(SO_RECVTIMEO)
    4. recv()

    The timeout is 5 seconds and the recv() does not timeout, but rather hangs indefinitely. The only other modem-related activity at the time is that I have a separate thread running to monitor signal strength, waiting on a recv().

    I'll have to try the older nrfxlib, but cannot for several weeks.

  • miked531 said:
    I have captured traces before, but it involved programming the 52840

     I assume you are talking about point 2 in this guide: https://devzone.nordicsemi.com/nordic/nordic-blog/b/blog/posts/how-to-get-modem-trace-using-trace-collector-in-nrf-connect ?

    The reason why that point is there is that the board controller decides if the pins used by UART_1 (which is the UART peripheral used for the modem trace) are routed to the on-board debugger, which forwards them over USB, or to the arduino headers.

    The board controller does this by setting P1.12 high (or not). The default is that UART_1 is routed to the on-board debugger. The board controller will configure the pins used for bord control during startup, based on the values of a set of Kconfig options. Therefore, as long as you do not use those pins in your application, and you have not set CONFIG_BOARD_PCA10090_UART1_ARDUINO=y. Other than that one pin, it does not matter what is running on the board controller.

    Though technically, you don't even have to use UART_1, or the UART_1 pins. You can change those details in the ncs/nrf/lib/bsdlib/bsd_os.c file. You can find more information about what is happening in that file here: https://developer.nordicsemi.com/nRF_Connect_SDK/doc/1.2.0/nrfxlib/bsdlib/doc/ug_bsdlib_porting_os.html

    However, that would probably require a different way of collecting the trace on the computer side than the Trace Collector application.

Reply Children
  • Your explanation for tracing is a little over my head as I don't have a great understanding of the board. It was much easier for me to chop out the pieces of my application that demonstrate the problem. I have included the code that illustrates the issue, and it does it consistently in under 5 minutes. The code creates a thread that listens for status notifications from the modem, while the main thread repeatedly turns modem on and off with some sleeps in between. In my application the sleeps are much longer, but I sped them up to demonstrate the problem in a reasonable amount of time.

    The output shows the application hanging at the end:

    *** Booting Zephyr OS build v2.1.99-ncs1  ***
    [00:00:00.188] opened socket, fd 1
    [00:00:00.191] send[1]: AT+CMEE=1
    [00:00:00.194] read[1]: OK
    [00:00:00.197] send[1]: AT+CFUN?
    [00:00:00.200] read[1]: +CFUN: 0
    [00:00:00.203] read[1]: OK
    [00:00:00.206] send[1]: AT%XSYSTEMMODE?
    [00:00:00.216] read[1]: %XSYSTEMMODE: 1,0,0,0
    [00:00:00.220] read[1]: OK
    [00:00:00.222] send[1]: AT+CGMI
    [00:00:00.231] read[1]: Nordic Semiconductor ASA
    [00:00:00.236] read[1]: OK
    [00:00:00.238] send[1]: AT+CGMM
    [00:00:00.248] read[1]: nRF9160-SICA
    [00:00:00.251] read[1]: OK
    [00:00:00.253] send[1]: AT+CGMR
    [00:00:00.263] read[1]: mfw_nrf9160_1.1.1
    [00:00:00.266] read[1]: OK
    [00:00:00.269] send[1]: AT+CGSN
    [00:00:00.278] read[1]: 352656100038754
    [00:00:00.282] read[1]: OK
    [00:00:00.284] send[1]: AT+CIMI
    [00:00:00.293] read[1]: +CME ERROR: 0
    [00:00:00.297] send[1]: AT+CEREG=5
    [00:00:00.306] read[1]: OK
    [00:00:00.309] send[1]: AT%CESQ=1
    [00:00:00.318] read[1]: OK
    [00:00:00.321] send[1]: AT+CGEREP=1
    [00:00:00.330] read[1]: OK
    [00:00:00.333] send[1]: AT+CNEC=24
    [00:00:00.342] read[1]: OK
    [00:00:00.345] send[1]: AT+CIND=1,1,1
    [00:00:00.355] read[1]: OK
    [00:00:00.357] send[1]: AT%XSIM=1
    [00:00:00.366] read[1]: OK
    [00:00:01.188] poweron modem
    [00:00:01.197] opened socket, fd 2
    [00:00:01.200] send[2]: AT+CMEE=1
    [00:00:01.209] read[2]: OK
    [00:00:01.212] send[2]: AT+CEMODE=2
    [00:00:01.221] read[2]: OK
    [00:00:01.224] send[2]: AT+CFUN=1
    [00:00:01.264] read[2]: OK
    [00:00:01.267] wait for 15 seconds
    [00:00:05.135] read[1]: %XSIM: 1
    [00:00:05.855] read[1]: %CESQ: 46,2,0,0
    [00:00:05.859] read[1]: +CEREG: 2,"121F","0115760F",7,0,0,"11100000","11100000"
    [00:00:08.199] read[1]: +CGEV: ME PDN ACT 0,0
    [00:00:08.204] read[1]: +CIND: "service",1
    [00:00:08.208] read[1]: +CIND: "roam",1
    [00:00:16.270] poweroff modem
    [00:00:16.279] opened socket, fd 2
    [00:00:16.282] send[2]: AT+CMEE=1
    [00:00:16.292] read[2]: OK
    [00:00:16.294] send[2]: AT+CFUN=4
    [00:00:18.988] read[2]: OK
    [00:00:18.997] wait for 15 seconds
    [00:00:34.000] poweron modem
    [00:00:34.009] opened socket, fd 2
    [00:00:34.012] send[2]: AT+CMEE=1
    [00:00:34.022] read[2]: OK
    [00:00:34.024] send[2]: AT+CEMODE=2
    [00:00:34.034] read[2]: OK
    [00:00:34.036] send[2]: AT+CFUN=1
    [00:00:34.077] read[2]: OK
    [00:00:34.079] wait for 15 seconds
    [00:00:49.083] poweroff modem
    [00:00:49.092] opened socket, fd 2
    [00:00:49.095] send[2]: AT+CMEE=1
    [00:00:50.104] recv[2] timeout after 1000ms
    [00:00:50.108] send[2]: AT+CFUN=4
    [00:01:00.117] recv[2] timeout after 10000ms
    [00:01:10.042] recv[1] timeout after 30000ms
    [00:01:10.046] re-opening the modem, fd 1
    

    broken.zip

  • Thanks for the code.

    One problem with your code is that you (eventually) try to connect to the network 30 times in 30 minutes. That will trigger a network protection feature that will keep you from being able to connect to the network.

    That is not the problem seen in your log though.

    However, I am not able to reproduce the error in your log.

    What SIM card are you using?

    If you are not using the iBasis SIM card that came with the DK, could you try that (assuming you are able to connect to your local network with it)?

  • Understood about the 30 connects. Normal operation has delays in the range of hours and the shortened times are to illustrate the problem in less time.

    I am in US and using a SIM from Deutsche Telekom, but have seen the same issue with the iBasis SIM, though less often (based on a very non-scientific observation). The main difference I have seen between the two SIMs is that DT will always use AT&T while iBasis goes between AT&T and Verizon.

    I have updated the program to hopefully create the deadlock more frequently. I have had mixed results, so maybe it will happen quicker or maybe not. It can take up to 30 minutes for it to occur so be patient.

    broken2.zip

  • I let your updated sample run for close to 7 hours, but I am not able to see any hangs. Though maybe I just don't know what to look for. The log is attached.

    I also modified your application to enable modem tracing. Could you try to run it, and see if you are able to capture a trace of the hangup?

    socket_maybe_stuck.txtbroken_w_trace.zip

  • I am now not able to create the hang even with my own code, so I have updated the code and attached it again as 'broken3.zip'. In short, I took out the AT+CFUN=1 and spawn several threads to abuse the modem with AT commands. During the handful of times I have run it, it sometimes lasts for a full minute before hanging. It starts getting timeouts then all the output eventually just stops.

    The other problem I see is that it sometimes just does a board reset and starts over with the "booting zephyr..." message, with no accompanying information, panic, etc.

    It is not performing a real-world scenario, but is able to consistently produce a hang on the socket(AF_LTE) call.

    broken3.zip

Related