This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Socket send function hangs

Hi,

I have a custom board with nrf9160 and I'm using it to send messages to the AWS server via mqtt. I managed to reproduce a bug where the TLS socket send function hangs (similar to -https://devzone.nordicsemi.com/f/nordic-q-a/62249/hang-in-sendmsg-on-the-nrf9160). I was using v1.3 of ncs.

Steps to reproduce - 

  • Connect to LTE netwrok. 
  • Connect to AWS via MQTT
  • When a successful connection is established, send a disconnect message
  • When successfully disconnected, reconnect and keep repeating this loop
  • On the 21st connection/disconnection cycle the function gets stuck in the mqtt_client_tls_write function while calling the 
    send(client->transport.tls.sock, data + offset, datalen - offset, 0);
    function

I switched to the latest master branch of ncs and changed the send function to send(client->transport.tls.sock, data + offset, datalen - offset, MSG_DONTWAIT) Now the send function doesn't hang but returns a value of -1 and the errno is -EAGAIN when I follow the steps to reproduce this bug.

How do I handle the socket timeout gracefully? After this timeout, if I try to reconnect to AWS I get 

aws_iot: getaddrinfo, error -10
aws_iot: client_broker_init, error: -10

and I have to restart the device to get it working again

I also noticed an error case which doesn't seem to be handled in aws_iot.c. When the socket send function returns -1, then in file mqtt.c, the function client_connect calls client_disconnect which in turn calls disconnect_event_notify function. This notifies the aws_iot client with 

evt.type = MQTT_EVT_CONNACK;
evt.result = -ECONNREFUSED;
which is not handled properly in aws_iot.c. Handling this properly still causes the same error I mentioned before though.
Any help will be appreciated. Thanks in advance!
Nikil
  • Hi Håkon,

    I have CONFIG_MQTT_KEEPALIVE=1200 in my code. Moreover this is what I get when I enable MQTT logs - 

    [00:00:21.134,094]<dbg> net_mqtt_sock_tls.mqtt_client_tls_connect: (0x20020ff8): Created socket 1
    [00:00:31.134,979]<dbg> net_mqtt_sock_tls.mqtt_client_tls_connect: (0x20020ff8): Connect completed
    [00:00:31.135,101]<dbg> net_mqtt_enc.connect_request_encode: (0x20020ff8): Encoding Protocol Description. Str:MQTT Size:00000004.
    [00:00:31.135,101]<dbg> net_mqtt_enc.pack_utf8_str: (0x20020ff8): >> str_size:00000006 cur:0x2002919a, end:0x2002957d
    [00:00:31.135,131]<dbg> net_mqtt_enc.pack_uint16: (0x20020ff8): >> val:0004 cur:0x2002919a, end:0x2002957d
    [00:00:31.135,131]<dbg> net_mqtt_enc.connect_request_encode: (0x20020ff8): Encoding Protocol Version 04.
    [00:00:31.135,131]<dbg> net_mqtt_enc.pack_uint8: (0x20020ff8): >> val:04 cur:0x200291a0, end:0x2002957d
    [00:00:31.135,162]<dbg> net_mqtt_enc.pack_uint8: (0x20020ff8): >> val:00 cur:0x200291a1, end:0x2002957d
    [00:00:31.135,162]<dbg> net_mqtt_enc.connect_request_encode: (0x20020ff8): Encoding Keep Alive Time 04b0.
    [00:00:31.135,162]<dbg> net_mqtt_enc.pack_uint16: (0x20020ff8): >> val:04b0 cur:0x200291a2, end:0x2002957d
    [00:00:31.135,162]<dbg> net_mqtt_enc.connect_request_encode: (0x20020ff8): Encoding Client Id. Str:nrf-352656100388050 Size:00000013.
    [00:00:31.135,192]<dbg> net_mqtt_enc.pack_utf8_str: (0x20020ff8): >> str_size:00000015 cur:0x200291a4, end:0x2002957d
    [00:00:31.135,192]<dbg> net_mqtt_enc.pack_uint16: (0x20020ff8): >> val:0013 cur:0x200291a4, end:0x2002957d
    [00:00:31.135,223]<dbg> net_mqtt_enc.mqtt_encode_fixed_header: (0x20020ff8): << msg type:0x10 length:0x0000001f
    [00:00:31.135,223]<dbg> net_mqtt_enc.packet_length_encode: (0x20020ff8): >> length:0x0000001f cur:0x00000000, end:0x00000000
    [00:00:31.135,223]<dbg> net_mqtt_enc.mqtt_encode_fixed_header: (0x20020ff8): Fixed header length = 02
    [00:00:31.135,223]<dbg> net_mqtt_enc.pack_uint8: (0x20020ff8): >> val:10 cur:0x20029198, end:0x2002957d
    [00:00:31.135,253]<dbg> net_mqtt_enc.packet_length_encode: (0x20020ff8): >> length:0x0000001f cur:0x20029199, end:0x2002957d
    [00:00:31.135,986]<dbg> net_mqtt.client_connect: (0x20020ff8): Connect completed
    

    Encoding Keep Alive Time 04b0. shows that it is indeed 1200 seconds

  • Hi,

     

    The trace show a TCP reset coming from the AWS server after 375 seconds, but it might be that the timestamping has been corrupted in the modem trace (its highspeed and high throughput, no UART flow control). If you provide another trace showing the issue, I can try to see if this one also shows the same behavior.

    You're testing with mfw v1.2.0, right?

    Kind regards,

    Håkon

  • Hi,

    I've attached a trace I took today. Yes, I'm using mfw 1.2.0.

    Another interesting thing that I found while scouring the internet to find a solution for this problem was this - https://stackoverflow.com/questions/58322484/how-to-keep-tcp-connections-alive-on-aws-network-loadbalancer. I'm not sure if this is even related or not but maybe there's some additional TCP timeout apart from the MQTT timeout which I need to set somewhere. The TCP timeout mentioned in the link above is 350 seconds which seems to be exactly the duration above which I have problems. 

     trace-2020-09-01T14-40-23.284Z.bin

  • Hi Nikil,

     

    Sorry for the late response.

    The second trace here is also a bit strange, as there seem to be packets missing from the trace. The first one looks more complete than the second one.

    Based on the link that you sent, the behavior is consistent with the first trace (or the valid part of it atleast), where it takes approx. 375 seconds before the AWS server sends a TCP RST. It may indicate that network load balancer is active, as the trace shows different IPs to the same DNS when you try to initiate a connection the second time. If it is, the restriction here should apply: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

     

    Kind regards,

    Håkon

Related