aws_iot_connect may not return

I have been developing an application using the AWS IoT library on build target nrf7002dk/nrf5340/cpuapp and nRF Connect SDK v2.7.0, and the application connects to Wi-Fi shared from my phone. Occasionally it happens that the function call aws_iot_connect never returns, and in the case of a fail I have noticed the following chain of calls.

aws_iot_connect
-> mqtt_helper_connect
-> client_connect
-> mqtt_connect
-> client_connect
-> mqtt_transport_connect
-> mqtt_client_tls_connect
-> zsock_connect
-> z_impl_zsock_connect
-> ztls_connect_ctx
-> tls_mbedtls_handshake with timeout K_FOREVER
-> wait_for_reason with reason MBEDTLS_ERR_SSL_WANT_READ
-> wait with timeout K_FOREVER, ZSOCK_POLLIN
-> zsock_poll
-> z_impls_zsock_poll
-> zsock_poll_internal
-> k_poll,

And k_poll does not return.

Is this something that should be expected to happen occasionally (in which case perhaps a case like this should be handled in nRF Connect SDK code or Zephyr code), or should this be preventable?

Parents
  • Hi Johan

    I assume you're referring to the k_poll() in line 2297 of sockets.c in \v2.7.0\zephyr\subsys\net\lib\sockets\sockets.c here, correct?

    It should eventually run into a timeout or be cancelled by an interrupt, but this doesn't happen I take it? Are you able to add a line of logging that prints a potential return code from this call and the calls before so we can see what's happening?

    Can you also share some more details on how often exactly "occasionally" is here?

    Best regards,

    Simon

Reply
  • Hi Johan

    I assume you're referring to the k_poll() in line 2297 of sockets.c in \v2.7.0\zephyr\subsys\net\lib\sockets\sockets.c here, correct?

    It should eventually run into a timeout or be cancelled by an interrupt, but this doesn't happen I take it? Are you able to add a line of logging that prints a potential return code from this call and the calls before so we can see what's happening?

    Can you also share some more details on how often exactly "occasionally" is here?

    Best regards,

    Simon

Children
  • Yes, that is the k_poll() that I am referring to. In ztls_connect_ctx the function tls_mbedtls_handshake is called with timeout K_FOREVER, which I think is forwarded as a timeout to k_poll. This seems to be confirmed by adding log messages before and after line 2297 of sockets.c:

    The log message "k_poll1" is generated by the thread in which tls_mbedtls_handshake is called, but not the log message "k_poll2".

     Typically I have been able to reproduce this within 5 minutes of repeated resetting of the DK and switching my phone hotspot off and on. The bug has become rarer by attempting aws_iot_connect in response to suitable NET_EVENT_DNS_SERVER_ADD events instead of the NET_EVENT_L4_CONNECTED event (see  aws_iot sample: the purpose of delayed connect_work ).  

Related