This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Socket send function hangs

Hi,

I have a custom board with nrf9160 and I'm using it to send messages to the AWS server via mqtt. I managed to reproduce a bug where the TLS socket send function hangs (similar to -https://devzone.nordicsemi.com/f/nordic-q-a/62249/hang-in-sendmsg-on-the-nrf9160). I was using v1.3 of ncs.

Steps to reproduce - 

  • Connect to LTE netwrok. 
  • Connect to AWS via MQTT
  • When a successful connection is established, send a disconnect message
  • When successfully disconnected, reconnect and keep repeating this loop
  • On the 21st connection/disconnection cycle the function gets stuck in the mqtt_client_tls_write function while calling the 
    send(client->transport.tls.sock, data + offset, datalen - offset, 0);
    function

I switched to the latest master branch of ncs and changed the send function to send(client->transport.tls.sock, data + offset, datalen - offset, MSG_DONTWAIT) Now the send function doesn't hang but returns a value of -1 and the errno is -EAGAIN when I follow the steps to reproduce this bug.

How do I handle the socket timeout gracefully? After this timeout, if I try to reconnect to AWS I get 

aws_iot: getaddrinfo, error -10
aws_iot: client_broker_init, error: -10

and I have to restart the device to get it working again

I also noticed an error case which doesn't seem to be handled in aws_iot.c. When the socket send function returns -1, then in file mqtt.c, the function client_connect calls client_disconnect which in turn calls disconnect_event_notify function. This notifies the aws_iot client with 

evt.type = MQTT_EVT_CONNACK;
evt.result = -ECONNREFUSED;
which is not handled properly in aws_iot.c. Handling this properly still causes the same error I mentioned before though.
Any help will be appreciated. Thanks in advance!
Nikil
Parents
  • Hi,

     

    Nikil Rao said:
    Thanks a lot for the response. Do you have an estimate on when the new modem firmware will be out?

     Unfortunately, I do not have the exact timeline. If you contact Thomas Page ([email protected]) - he might have more details on this matter.

      

    Nikil Rao said:
    Doing this however, the device needs a reset after every 20 connect cycles (every 200 minutes). Due to the high current consumption when the device resets and connects to the LTE network every 200 minutes, the lifetime of our device reduces drastically. Could you please advice on possible solutions we can try until the new modem firmware is out? Thanks!

    My apologies for the inconvenience this bug is causing. I understand that a reset is costly.

    I unfortunately do not have a good workaround for your case. The problem is that in the network chain, you're NAT'ed through a LTE network, which might have their own timeouts (a "NAT timeout") on TCP connections, which would then override any keepalive the server has on a strict IP level. If you provide a modem trace, we can verify if its the network restricting the keep-alive.

     

    Is it a possibility to use PSM in your application, and keep the "keep alive" to under 327 sec? Or does this add up to a higher current consumption over-time compared to your current applications connect/disconnect algorithm?

     

    Kind regards,

    Håkon

Reply
  • Hi,

     

    Nikil Rao said:
    Thanks a lot for the response. Do you have an estimate on when the new modem firmware will be out?

     Unfortunately, I do not have the exact timeline. If you contact Thomas Page ([email protected]) - he might have more details on this matter.

      

    Nikil Rao said:
    Doing this however, the device needs a reset after every 20 connect cycles (every 200 minutes). Due to the high current consumption when the device resets and connects to the LTE network every 200 minutes, the lifetime of our device reduces drastically. Could you please advice on possible solutions we can try until the new modem firmware is out? Thanks!

    My apologies for the inconvenience this bug is causing. I understand that a reset is costly.

    I unfortunately do not have a good workaround for your case. The problem is that in the network chain, you're NAT'ed through a LTE network, which might have their own timeouts (a "NAT timeout") on TCP connections, which would then override any keepalive the server has on a strict IP level. If you provide a modem trace, we can verify if its the network restricting the keep-alive.

     

    Is it a possibility to use PSM in your application, and keep the "keep alive" to under 327 sec? Or does this add up to a higher current consumption over-time compared to your current applications connect/disconnect algorithm?

     

    Kind regards,

    Håkon

Children
No Data
Related