We had some of our devices in field randomly disconnecting and restarting at strange intervals and after some debugging we found an issue opening a small tls/tcp socket next to a relatively congested non-tls/tcp connection. All link operations seem to be within the modem specs and should complete without issues.
Problem description
In our current setup we open the following sockets:
- a non-tls/tcp socket due to the high throughput and relatively big packet sizes (10-18Kbps) data transfer this is the raw data transfer socket.
- an udp-socket which connects to several servers, it is not configured to use dtls and it opens and closes the sockets sequentially (only 1 active at a time)
- a tls/tcp control socket to configure the device, this again is configured to not happen at the same time as the udp socket.
During high load on the raw data transfer socket the connect call for the tls control socket returns a -ENOMEM, when we disable TLS on the control channel it seems to be able to connect at even higher synthetic loads than we see in the field.
When the raw data transfer is low < 2Kbps the device connects all sockets without issues, this seems to indicate its not a hard modem limit.
What we have tried
- Ensured that the transmit buffer of the TLS calls is smaller than the specified 2kb in the control interface
- Ensured the SHMEM heap is big enough (Max 4740/16384 allocated)
- Ensured the control heap is big enough (Max 28/1024 allocated)
- Ensured the Work queue isnt blocked to receive modem communication
- Ensured all sockets are closed when no longer in use
- Tried (and failed) to find documentation about an ENOMEM response on a connect call
- Debugged with wire shark and modem traces to find any failure reasons or errors (Can provide traces if needed)
- Disabling the UDP
Some solutions for this problem
One of the solutions we see to this issue is to disable TLS on the control channel of the device, however we would like to keep it enabled for obvious reasons.
An different solution is to detect a situation where the connect call would fail and temporarily block the raw data stream to allow enough modem resources for the TLS handshake / setup on the control channel, for this we would need to know what causes the connect call to fail.
The best solution for us would be to "reserve" the modem resources ahead of time and guarantee the raw data socket leaves enough space to connect the TLS socket when it is needed. We are unsure if this can be done with the current firmware.
Sumary
There is a free running non-tls/tcp connection which seems to sometimes (depending on link congestion) block a tls/tcp connect call from connecting successfully. Asynchronously there are some UDP connections which don't seem to influence this behavior.
Any pointers to documentation about this behavior would be appreciated.
Software/firmware used
nrf connect sdk 2.9
nrf9160 mfw 1.3.7