This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Socket send function hangs

Hi,

I have a custom board with nrf9160 and I'm using it to send messages to the AWS server via mqtt. I managed to reproduce a bug where the TLS socket send function hangs (similar to -https://devzone.nordicsemi.com/f/nordic-q-a/62249/hang-in-sendmsg-on-the-nrf9160). I was using v1.3 of ncs.

Steps to reproduce - 

  • Connect to LTE netwrok. 
  • Connect to AWS via MQTT
  • When a successful connection is established, send a disconnect message
  • When successfully disconnected, reconnect and keep repeating this loop
  • On the 21st connection/disconnection cycle the function gets stuck in the mqtt_client_tls_write function while calling the 
    send(client->transport.tls.sock, data + offset, datalen - offset, 0);
    function

I switched to the latest master branch of ncs and changed the send function to send(client->transport.tls.sock, data + offset, datalen - offset, MSG_DONTWAIT) Now the send function doesn't hang but returns a value of -1 and the errno is -EAGAIN when I follow the steps to reproduce this bug.

How do I handle the socket timeout gracefully? After this timeout, if I try to reconnect to AWS I get 

aws_iot: getaddrinfo, error -10
aws_iot: client_broker_init, error: -10

and I have to restart the device to get it working again

I also noticed an error case which doesn't seem to be handled in aws_iot.c. When the socket send function returns -1, then in file mqtt.c, the function client_connect calls client_disconnect which in turn calls disconnect_event_notify function. This notifies the aws_iot client with 

evt.type = MQTT_EVT_CONNACK;
evt.result = -ECONNREFUSED;
which is not handled properly in aws_iot.c. Handling this properly still causes the same error I mentioned before though.
Any help will be appreciated. Thanks in advance!
Nikil
  • Hi,

    We recently found an issue that corresponds to your description when connect/disconnect looping with a TLS connection:

    On the 21st connection/disconnection cycle the function gets stuck

    We are currently testing and implementing this fix into a newer modem fw version (v1.2.2). At this time, with the currently available modem fw v1.2.0 or older), we unfortunately do not have a good workaround (a reset shall help, but that is not a good fix).

    I must apologize for the inconvenience this has caused.

     

    Kind regards,

    Håkon

  • Hi Håkon, 

    Thanks a lot for the response. Do you have an estimate on when the new modem firmware will be out?

    We have a use case where we connect to AWS and send data every 10 minutes. Since I'm not able to keep a connection to AWS alive for more than 5 minutes (https://devzone.nordicsemi.com/support/252089) I'm trying a workaround where I send a message and then disconnect from AWS MQTT. I then reconnect after 10 minutes and do the whole cycle again. Doing this however, the device needs a reset after every 20 connect cycles (every 200 minutes). Due to the high current consumption when the device resets and connects to the LTE network every 200 minutes, the lifetime of our device reduces drastically. Could you please advice on possible solutions we can try until the new modem firmware is out? Thanks!

    Nikil

  • Hi,

     

    Nikil Rao said:
    Thanks a lot for the response. Do you have an estimate on when the new modem firmware will be out?

     Unfortunately, I do not have the exact timeline. If you contact Thomas Page ([email protected]) - he might have more details on this matter.

      

    Nikil Rao said:
    Doing this however, the device needs a reset after every 20 connect cycles (every 200 minutes). Due to the high current consumption when the device resets and connects to the LTE network every 200 minutes, the lifetime of our device reduces drastically. Could you please advice on possible solutions we can try until the new modem firmware is out? Thanks!

    My apologies for the inconvenience this bug is causing. I understand that a reset is costly.

    I unfortunately do not have a good workaround for your case. The problem is that in the network chain, you're NAT'ed through a LTE network, which might have their own timeouts (a "NAT timeout") on TCP connections, which would then override any keepalive the server has on a strict IP level. If you provide a modem trace, we can verify if its the network restricting the keep-alive.

     

    Is it a possibility to use PSM in your application, and keep the "keep alive" to under 327 sec? Or does this add up to a higher current consumption over-time compared to your current applications connect/disconnect algorithm?

     

    Kind regards,

    Håkon

  • Hi, 

    Thanks a lot for the reply. I reproduced the AWS disconnect on the nrf9160dk and I've attached a modem trace. Could you please analyze it and let me know if it is indeed an NAT timeout?

    Yes, right now we have kept the message send interval to 300 seconds while we wait for the new modem firmware

    trace-2020-08-31T08-41-21.471Z.bin

  • Hi Nikil,

     

    Judging by the trace, it seems that the server (aws) sends a TCP RST (reset) after approx. 375 seconds, which indicates a keep-alive of 250 (1.5 * 250 = 375).

    Is your local MQTT keep-alive set to 250 in the trace? If yes, can you please set it to 1200 and do a new trace?

     

    Kind regards,

    Håkon

Related