This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

UDP socket becomes invalid during long PSM sleep

Hi,

I've recently switched some prototype code from using MQTT to a custom UDP-based protocol. What I noticed is that if I have the unit sleep for longer periods of time, when it wakes up, the UDP socket behaves oddly. Generally the behavior is that the send() call will succeed (though no packet is received on the server), and then poll(POLLIN) (to wait for an ack packet from the server) will return POLLHUP, and subsequent send() calls return ENOTSOCK or ELOOP. Closing and re-creating the UDP socket appears to resolve this issue.

Our 9160 units are running on Verizon with a PSM cycle set to 190 minutes. The odd behavior happens if I sleep for 90 minutes between transmission cycles. If I sleep for say 15 minutes, the UDP socket works fine for multiple cycles.

I'd assume this behavior is specific to the hardware-offloaded sockets of the 9160, but would love some confirmation of the expected behavior/usage of sockets when in PSM sleep cycles.

Thanks,
Eric

Parents
  • Hi,

     

    Q1: What modem fw are you running on?

    Q2: Which version of ncs are you running?

     

    Kind regards,

    Håkon

  • mfw_nrf9160_1.0.1

    v1.14.99-ncs3-snapshot2-1266-g8711cfd5d348

  • Hi,

     

    Thank you, it seems that you are using quite a new firmware version, and the latest modem.

    I'm setting up a test bench on this matter, where I kept the socket open and initiated a UDP send/recv every 120 minutes. This worked at my end, using similar setup as you.

    However; I am connected to a different vendor, so I am unable to check if there's any in the network you're connected to that is blocking this for you. Some firewalls detect UDP send/recv cycles, and times these dynamically, but this is just one of the things that might be causing this behavior on your side. I assume that there's other routines running in your system, could RAM corruption of the socket_fd be a root cause, for instance?

     

    Could you try to do a modem trace that shows this issue?

    Here's a guide on how to do this: https://devzone.nordicsemi.com/nordic/nordic-blog/b/blog/posts/how-to-get-modem-trace-using-trace-collector-in-nrf-connect

    For the firmware side; Its essentially adding "CONFIG_BSD_LIBRARY_TRACE_ENABLED=y" to your prj.conf and sending AT command "AT%XMODEMTRACE=1,2" at boot up.

     

    Note: The modem trace should be several kilobytes at the start (and continue to grow, especially in the beginning), if the above process has worked.

     

    Kind regards,

    Håkon

  • Hi Håkon,

    Thanks so much for your assistance with this. In your setup, did you also have PSM mode on the radio enabled? 

    Firewalls should not be an issue due to the network setup I am using (no firewalls), but also I would not expect them to affect the socket errors I am seeing, as they would simply cause packets to be dropped.

    Regarding RAM corruption of the socket_fd, that is an interesting angle. I was thinking that it should be unrelated to the time interval used, but that is not quite true, as my application buffers data in the background at a different rate than the transmission cycle. I suppose it is possible that somehow more data being buffered could be causing corruption somehow, but still seems unlikely with the way I have things structured.

    I switched to a new pattern that still shows the issue:

    -PSM mode set  to 190 min and k_work queued for 90 min; CPU goes to sleep

    -After 90 min, CPU wakes up and runs k_work callback, which opens new UDP socket on ephemeral_port, call sendto() to remote IP addr, which succeeds, though server doesn't get packet)

    -Wait on poll() for response from server. This returns POLLHUP instead of timing out.

    -Sleep for 1000ms. Close socket, loop back to beginning for next attempt

    -Open new socket and call send(), which returns errno 114. Wait 1000ms, close socket, and loop back for next attempt

    -On 3rd attempt, send succeeds, poll succeeds, and recv() succeeds

    Note that I do have some [possibly unrelated] questions about whether I should be doing these blocking calls from within a k_work callback.

    Regarding the modem trace, this application is currently not running on a DK board but rather our own board. The instructions you pointed me to seem to imply it uses the nrf52's UART to collect the data from the 9160. I'd have to figure out how to make that work with our setup. Currently the only UART we have on a header on our PCB is UART0 for serial console.

  • Hi Eric,

     

    Eric Gross said:
    Currently the only UART we have on a header on our PCB is UART0 for serial console.

     As a quick test, you can rename the uart0 to uart1 (and vice-versa) in the dts file, so that your default output will be routed to another set of pins, while the current routed uart gpios will run on "uart1":

    https://github.com/NordicPlayground/fw-nrfconnect-zephyr/blob/master/boards/arm/nrf9160_pca10090/nrf9160_pca10090_common.dts#L83

    Remember to uncheck the "auto device/filter port" option, so that all COMs are visible:

     

    The downside of this is that your print outs will not work, but you can use RTT for that by adding these to your prj.conf:

    CONFIG_USE_SEGGER_RTT=y
    CONFIG_LOG=y

     

    Eric Gross said:
    Thanks so much for your assistance with this. In your setup, did you also have PSM mode on the radio enabled? 

     Ran it for around ~18 hours, with PSM set to 120 minutes, and sending around every 120 minutes as well. The socket wasn't lost.

    Based in this sample: https://github.com/Rallare/fw-nrfconnect-nrf/tree/nrf9160_samples/samples/nrf9160/udp_with_psm

    Where I needed to modify several things to change the PSM interval, as well as make bind/connect run only once + the socket_fd as a global.

     

    Eric Gross said:
    Note that I do have some [possibly unrelated] questions about whether I should be doing these blocking calls from within a k_work callback.

     That should be fine, as long as you're not calling socket calls from interrupts (uart callback, k_timer instance (but not work queues!), etc are interrupt context). If you're unsure of which functions are interrupts, you can call k_is_in_isr() to check.

      

    Eric Gross said:
    Open new socket and call send(), which returns errno 114.

     NRF_EALREADY        114 //< Operation already in progress

    It seems the socket is in a limbo state if you still get this errno back even after creating a new socket_fd. Could you try isolating the process of sending every 90 minutes, and see if the issue still occurs then?

    Kind regards,

    Håkon

  • Thanks again for all the great assistance! I have a lot of things to follow up with on my side based on your suggestions. I wasn't aware of the CONFIG_USE_SEGGER_RTT option, but that's pretty neat.

    Regarding the blocking calls in the k_work callback, I wasn't concerned about being in ISR context, but rather the rather vague warning from the Zephyr documentation about doing any blocking calls in the system workqueue. I don't have a separate workqueue currently defined.

    I'll be out-of-the-office for a bit before I can get back to trying your suggestions, so I'll give you an update once I have a chance.

    Eric

Reply
  • Thanks again for all the great assistance! I have a lot of things to follow up with on my side based on your suggestions. I wasn't aware of the CONFIG_USE_SEGGER_RTT option, but that's pretty neat.

    Regarding the blocking calls in the k_work callback, I wasn't concerned about being in ISR context, but rather the rather vague warning from the Zephyr documentation about doing any blocking calls in the system workqueue. I don't have a separate workqueue currently defined.

    I'll be out-of-the-office for a bit before I can get back to trying your suggestions, so I'll give you an update once I have a chance.

    Eric

Children
  • Eric Gross said:
    Regarding the blocking calls in the k_work callback, I wasn't concerned about being in ISR context, but rather the rather vague warning from the Zephyr documentation about doing any blocking calls in the system workqueue. I don't have a separate workqueue currently defined.

     That depends on what a blocking call is. A while(1) will essentially block other threads, with equal or lower priority, but a recv() (or other "infinite wait" socket calls) will yield in that sense, due to the bsd_os.c::bsd_os_timedwait() implementation which handles the timing on the sockets.

     

    Eric Gross said:
    I'll be out-of-the-office for a bit before I can get back to trying your suggestions, so I'll give you an update once I have a chance.

     Ok, just update me when you pick up the work again. Hope you have a nice weekend!

    Kind regards,

    Håkon

  • the rather vague warning from the Zephyr documentation about doing any blocking calls in the system workqueue

    Short version: many different parts of Zephyr make use of the system workqueue.  It's effectively a single thread that different drivers/libraries can make use of when they need to do work in a lower priority context than their normal thread or interrupt level.  The issue is that if any of the workqueue handler functions block, all of the other requests to have work done in the system workqueue are blocked behind it (because it's just one thread).

    So, for example, if your workqueue function blocks on a socket function, it could have unexpected results causing other drivers to stop working until your function unblocks.

    If you really like the workqueue API but know your handlers might need to block sometimes, you can declare additional workqueue threads and submit your blocking work to one of those to ensure the system workqueue remains unblocked.  The system workqueue is just a sort of default workqueue thread that is assumed to be always present.

  • Thanks, I think this was my next item to refactor. I had a bunch of different tasks at different rates and the workqueue submission with a delay made it easy to multiplex them in a simple manner compared to managing it myself with tracking time in a main() loop like some of the examples do. I'll switch to a dedicated workqueue though.

Related