This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

mqtt_client_tls_write hangs on writing to TLS socket send(client->transport.tls.sock....)

Hi,

We are using the modem firmware V1.2 and source code ncs v1.2.0.

Problem: we are using the TLS connected socket to amazon cloud and after about 12 minutes of doing:

```

waiting 2 min
connecting to MQTT
sending a few packets
disconnecting from MQTT

```

we get a hang on a specific call out of my reach:
File: mqtt_transport_socket_tls.c

Function: int mqtt_client_tls_write(struct mqtt_client *client, const u8_t *data, u32_t datalen)

Inside the function is a loop and a call here - which the source code isn't revealed for.
ret = send(client->transport.tls.sock, data + offset, datalen - offset, 0);

The last argument - flags - is set to 0 by Nordic, so if there is any problem sending, this call will hang forever.
Linux supports having MSG_DONTWAIT and returncodes to handle situations if they occur but now it's just not working.

We need a fix for this bug.
If the transmit fails, there needs to be a timeout so we can recover and not get hung, which kills the application calling.
If I can produce a trace or whatever needed for this let me know. We are running our custom code so it only works on our custom hw.

/Johan

  • Hi!

    It looks like bsdlib 0.7.0 added support for send timeout on TCP, including secure sockets.  So upgrading to NCS tag 1.3.0, which uses bsdlib version 0.7.3 might solve this issue for you. 

    Best regards,

    Heidi

  • All right, I will try that approach.

    But If I use timeout, what is your recommended solution handling when the send function fails?

    Do you have a sample for correctly handling errors like this? I have no idea how to error handle this case your code produces if Im using a timeout as a "break" operation when it otherwise would hang.


  • If the send-function fails, re-try sending. If it doesn't work after a few re-tries, you could for example close the socket and open a new one to try again. 

    Have a look at the Asset Tracker application error handler here.

  • Thanks Heidi;

    Question upon your error codes from the send()-function.

    Since you are advising me to start using the 1.3.0 release, which allows settings socket options such as send timeout (linux naming: SO_SNDTIMEO) - I wonder if you are following the rules outlined in the linux guidelines for this timeout.

    As of:
    https://linux.die.net/man/7/socket

    Reading about SO_SNDTIMEO - it states. 
    "SO_RCVTIMEO and SO_SNDTIMEOSpecify the receiving or sending timeouts until reporting an error. The argument is a struct timeval. If an input or output function blocks for this period of time, and data has been sent or received, the return value of that function will be the amount of data transferred; if no data has been transferred and the timeout has been reached then -1 is returned with errno set to EAGAIN or EWOULDBLOCK, or EINPROGRESS (for connect(2)) just as if the socket was specified to be nonblocking. If the timeout is set to zero (the default) then the operation will never timeout. Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on."

    If yes, you do, then - handling transmit errors, I then will need to handle the cases of partial sends, ie the send function returns how many bytes were successfully transmitted - ie how many are left from my transmit buffer, so I need to send the remaining part in the bufffer and not resend the entire buffer.

    If I faulty resend the entire buffer, and the modem sent part of it successfully, the receiver will get corrrupted data. Unacceptable situation.

    These questions are important and it is very unclear how you handle this, and how I should handle this to get 100% accuracy and control over what has been sent correctly to the connected network.


  • Hi!

    Yes, the SO_SNDTIMEO follows the guidelines you linked to.

    If the send-function returns a positive value, you will need to move the offset in the buffer from which you are sending from. And if the return is negative you don't move the offset.

    I'm not sure if there is an example of this implementation in NCS, but I'm sure there is one on the internet somewhere. 

Related