Bug report: Cannot use Zephyr's poll() function from two threads for the same socket concurrently

This is a bug report for nRF Connect SDK 3.1.1.

The zsock_poll() method and hence also the poll() wrapper method https://docs.nordicsemi.com/bundle/zephyr-apis-3.1.1/page/group_bsd_sockets.html#ga518361903c9fac3766164d38243872e3 is broken with TCP sockets on nRF9160 when polling the same socket from two different threads at the same time. A typical use case would be that one reader thread waits for POLLIN and another writer thread waits for POLLOUT. If one thread is waiting for poll (blocked right now) and another thread calls poll() on the same socket, the previous poll operation will never complete, even if the condition becomes satisfied. Bypassing Zephyr and using nrf sockets directly (nrf_socket, nrf_connect, nrf_poll etc.) instead works as expected though.

One of the root causes seems to be that the nrf91 socket offloading plugin for Zephyr uses the NRF_SO_POLLCB socket option (https://docs.nordicsemi.com/bundle/ncs-3.1.1/page/nrfxlib/nrf_modem/doc/sockets/socket_options_func.html) to set up a callback (in interrupt context) when the socket becomes ready. When it is ready, the thread waiting for poll is notified and woken up. However, the documentation for NRF_SO_POLLCB is incomplete. Importantly, there is information missing that if you call setsockopt with NRF_SO_POLLCB again on the same socket, the previously set up callback will be silently removed. The nrf91 socket offloading code in Zephyr currently does not assume this non-documented behaviour, so therefore it doesn't work with multiple pending polls. Additionally, the Type column says the type is a struct nrf_pollcb. This is wrong, it should be a struct nrf_modem_pollcb.

Code in Zephyr preparing the poll: https://github.com/nrfconnect/sdk-nrf/blob/v3.1.1/lib/nrf_modem_lib/nrf9x_sockets.c#L904. The second bug is at line 901, where a "signal" object belonging to the socket is re-initialized (cleared) and thus overwritten, if another poll operation in another thread is ongoing, which is attached to this signal object (in the signal object's linked list "poll_events").

Steps to reproduce:

1. Establish a TCP connection on nRF9160 using Zephyr's socket(), connect() methods.

2. Start two threads. Each thread should execute a poll() method with -1 as timeout (infinite) and the same socket. As an example, let both be for the POLLIN events. (Or listen to different poll events, e.g. POLLIN in one thread and POLLOUT in another thread.)

3. Once both threads are currently blocked in the poll() call, send some data from the remote peer.

4. You will see that only the last thread that entered poll() will return (assuming both threads waited for POLLIN). The other will be stuck, even if more data is sent later.

Related