This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

UDP socket becomes invalid during long PSM sleep

Hi,

I've recently switched some prototype code from using MQTT to a custom UDP-based protocol. What I noticed is that if I have the unit sleep for longer periods of time, when it wakes up, the UDP socket behaves oddly. Generally the behavior is that the send() call will succeed (though no packet is received on the server), and then poll(POLLIN) (to wait for an ack packet from the server) will return POLLHUP, and subsequent send() calls return ENOTSOCK or ELOOP. Closing and re-creating the UDP socket appears to resolve this issue.

Our 9160 units are running on Verizon with a PSM cycle set to 190 minutes. The odd behavior happens if I sleep for 90 minutes between transmission cycles. If I sleep for say 15 minutes, the UDP socket works fine for multiple cycles.

I'd assume this behavior is specific to the hardware-offloaded sockets of the 9160, but would love some confirmation of the expected behavior/usage of sockets when in PSM sleep cycles.

Thanks,
Eric

Parents
  • Hi,

     

    Q1: What modem fw are you running on?

    Q2: Which version of ncs are you running?

     

    Kind regards,

    Håkon

  • mfw_nrf9160_1.0.1

    v1.14.99-ncs3-snapshot2-1266-g8711cfd5d348

  • Hi Eric,

     

    Eric Gross said:
    Currently the only UART we have on a header on our PCB is UART0 for serial console.

     As a quick test, you can rename the uart0 to uart1 (and vice-versa) in the dts file, so that your default output will be routed to another set of pins, while the current routed uart gpios will run on "uart1":

    https://github.com/NordicPlayground/fw-nrfconnect-zephyr/blob/master/boards/arm/nrf9160_pca10090/nrf9160_pca10090_common.dts#L83

    Remember to uncheck the "auto device/filter port" option, so that all COMs are visible:

     

    The downside of this is that your print outs will not work, but you can use RTT for that by adding these to your prj.conf:

    CONFIG_USE_SEGGER_RTT=y
    CONFIG_LOG=y

     

    Eric Gross said:
    Thanks so much for your assistance with this. In your setup, did you also have PSM mode on the radio enabled? 

     Ran it for around ~18 hours, with PSM set to 120 minutes, and sending around every 120 minutes as well. The socket wasn't lost.

    Based in this sample: https://github.com/Rallare/fw-nrfconnect-nrf/tree/nrf9160_samples/samples/nrf9160/udp_with_psm

    Where I needed to modify several things to change the PSM interval, as well as make bind/connect run only once + the socket_fd as a global.

     

    Eric Gross said:
    Note that I do have some [possibly unrelated] questions about whether I should be doing these blocking calls from within a k_work callback.

     That should be fine, as long as you're not calling socket calls from interrupts (uart callback, k_timer instance (but not work queues!), etc are interrupt context). If you're unsure of which functions are interrupts, you can call k_is_in_isr() to check.

      

    Eric Gross said:
    Open new socket and call send(), which returns errno 114.

     NRF_EALREADY        114 //< Operation already in progress

    It seems the socket is in a limbo state if you still get this errno back even after creating a new socket_fd. Could you try isolating the process of sending every 90 minutes, and see if the issue still occurs then?

    Kind regards,

    Håkon

  • Thanks again for all the great assistance! I have a lot of things to follow up with on my side based on your suggestions. I wasn't aware of the CONFIG_USE_SEGGER_RTT option, but that's pretty neat.

    Regarding the blocking calls in the k_work callback, I wasn't concerned about being in ISR context, but rather the rather vague warning from the Zephyr documentation about doing any blocking calls in the system workqueue. I don't have a separate workqueue currently defined.

    I'll be out-of-the-office for a bit before I can get back to trying your suggestions, so I'll give you an update once I have a chance.

    Eric

  • Eric Gross said:
    Regarding the blocking calls in the k_work callback, I wasn't concerned about being in ISR context, but rather the rather vague warning from the Zephyr documentation about doing any blocking calls in the system workqueue. I don't have a separate workqueue currently defined.

     That depends on what a blocking call is. A while(1) will essentially block other threads, with equal or lower priority, but a recv() (or other "infinite wait" socket calls) will yield in that sense, due to the bsd_os.c::bsd_os_timedwait() implementation which handles the timing on the sockets.

     

    Eric Gross said:
    I'll be out-of-the-office for a bit before I can get back to trying your suggestions, so I'll give you an update once I have a chance.

     Ok, just update me when you pick up the work again. Hope you have a nice weekend!

    Kind regards,

    Håkon

  • the rather vague warning from the Zephyr documentation about doing any blocking calls in the system workqueue

    Short version: many different parts of Zephyr make use of the system workqueue.  It's effectively a single thread that different drivers/libraries can make use of when they need to do work in a lower priority context than their normal thread or interrupt level.  The issue is that if any of the workqueue handler functions block, all of the other requests to have work done in the system workqueue are blocked behind it (because it's just one thread).

    So, for example, if your workqueue function blocks on a socket function, it could have unexpected results causing other drivers to stop working until your function unblocks.

    If you really like the workqueue API but know your handlers might need to block sometimes, you can declare additional workqueue threads and submit your blocking work to one of those to ensure the system workqueue remains unblocked.  The system workqueue is just a sort of default workqueue thread that is assumed to be always present.

  • Thanks, I think this was my next item to refactor. I had a bunch of different tasks at different rates and the workqueue submission with a delay made it easy to multiplex them in a simple manner compared to managing it myself with tracking time in a main() loop like some of the examples do. I'll switch to a dedicated workqueue though.

Reply Children
No Data
Related