This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Socket send function hangs

Hi,

I have a custom board with nrf9160 and I'm using it to send messages to the AWS server via mqtt. I managed to reproduce a bug where the TLS socket send function hangs (similar to -https://devzone.nordicsemi.com/f/nordic-q-a/62249/hang-in-sendmsg-on-the-nrf9160). I was using v1.3 of ncs.

Steps to reproduce - 

  • Connect to LTE netwrok. 
  • Connect to AWS via MQTT
  • When a successful connection is established, send a disconnect message
  • When successfully disconnected, reconnect and keep repeating this loop
  • On the 21st connection/disconnection cycle the function gets stuck in the mqtt_client_tls_write function while calling the 
    send(client->transport.tls.sock, data + offset, datalen - offset, 0);
    function

I switched to the latest master branch of ncs and changed the send function to send(client->transport.tls.sock, data + offset, datalen - offset, MSG_DONTWAIT) Now the send function doesn't hang but returns a value of -1 and the errno is -EAGAIN when I follow the steps to reproduce this bug.

How do I handle the socket timeout gracefully? After this timeout, if I try to reconnect to AWS I get 

aws_iot: getaddrinfo, error -10
aws_iot: client_broker_init, error: -10

and I have to restart the device to get it working again

I also noticed an error case which doesn't seem to be handled in aws_iot.c. When the socket send function returns -1, then in file mqtt.c, the function client_connect calls client_disconnect which in turn calls disconnect_event_notify function. This notifies the aws_iot client with 

evt.type = MQTT_EVT_CONNACK;
evt.result = -ECONNREFUSED;
which is not handled properly in aws_iot.c. Handling this properly still causes the same error I mentioned before though.
Any help will be appreciated. Thanks in advance!
Nikil
Parents
  • Hi,

     

    The trace show a TCP reset coming from the AWS server after 375 seconds, but it might be that the timestamping has been corrupted in the modem trace (its highspeed and high throughput, no UART flow control). If you provide another trace showing the issue, I can try to see if this one also shows the same behavior.

    You're testing with mfw v1.2.0, right?

    Kind regards,

    Håkon

Reply
  • Hi,

     

    The trace show a TCP reset coming from the AWS server after 375 seconds, but it might be that the timestamping has been corrupted in the modem trace (its highspeed and high throughput, no UART flow control). If you provide another trace showing the issue, I can try to see if this one also shows the same behavior.

    You're testing with mfw v1.2.0, right?

    Kind regards,

    Håkon

Children
No Data
Related