Download client (via AWS FOTA) stops at 50% and can't reconnect to AWS

Hello,

I am using the AWS IoT library and I came across an issue when downloading a new image (FOTA) from AWS. The FOTA process starts so the image is downloaded up to 50% where apparently the peer closes the connection (received length is 0) and it the download client can't reconnect to the server. The overall FOTA process then fails. 

My board is the nrf52840DK with the Ethernet W5500 Arduino shield on top. MQTT pub and sub works fine (also with device shadow). I using nRF SDK 2.6.0. 

I attach here my prj.conf file 1072.prj.conf

Here is the error:

More details: as you can see from the prj.conf file, I am using the Zephyr built-in mbedTLS and all communications are with TLS on (MQTTS and HTTPS). 

As you see from the log at around 50% the download client prints "Peer closed connection, will reconnect" and then fails to reconnect throwing a errno 2 (No such file or directory).

Looking at the download client source code, the "Peer closed connection" error is when a received packet has length 0. 

It always fails in the same way. No idea why,

Please help.

Thank you.

Marco 

Parents
  • Update:

    I have managed to get a full download working. I have simply enabled the download client DBG log and it magically worked. Although if I disable the DBG level log then the issue appears again. I guess the issue has something to do with timing as the DBG level log should slow everything down due to the more verbose log being printed. Indeed I have also added these configs to not miss any LOG message:

    CONFIG_SEGGER_RTT_BUFFER_SIZE_UP=4096
    CONFIG_SEGGER_RTT_MODE_BLOCK_IF_FIFO_FULL=y
    CONFIG_LOG_BLOCK_IN_THREAD=y
    CONFIG_LOG_BLOCK_IN_THREAD_TIMEOUT_MS=-1

    Any thoughts? 

    Here is the new prj.conf file 04737.prj.conf

    and the full RTT log that shows that the download completed successfully. I also see the new application running as I made a specific application that prints " THIS IS THE NEW IMAGE FROM AWS S3!"

    download_full.log

    # SEGGER J-Link RTT Viewer V7.88e Terminal Log File
    # Compiled: 16:28:05 on May 31 2023
    # Logging started @ 15 Jun 2024 12:48:27
    00> [00:00:00.007,659] <inf> daliif: DALI IF READY!
    00> [00:00:00.016,052] <inf> eth_w5500: W5500 Initialized
    00> *** Booting nRF Connect SDK v3.5.99-ncs1 ***
    00> [00:00:00.017,456] <inf> demoapp: DALi Demo App.
    00>  Tag: 
    00>  Branch: conn_mng
    00>  Hash: @a14563a
    00>  Dirty: true
    00>  semver for MCUBOOT: ..
    00> [00:00:00.017,456] <inf> demoapp: THIS IS THE NEW IMAGE FROM AWS S3!
    00> [00:00:00.017,913] <inf> demoapp: Launching AWS test application: OK
    00> [00:00:00.017,974] <inf> dhcp_test: Network connectivity lost
    00> [00:00:00.018,066] <inf> dhcp_test: Starting DHCP on w5500@0: index=1
    00> [00:00:07.141,204] <inf> net_dhcpv4: Received: 192.168.1.117
    00> [00:00:07.141,418] <inf> dhcp_test:    Address[1]: 192.168.1.117
    00> [00:00:07.141,479] <inf> dhcp_test:     Subnet[1]: 255.255.255.0
    00> [00:00:07.141,540] <inf> dhcp_test:     Router[1]: 192.168.1.1
    00> [00:00:07.141,571] <inf> dhcp_test: Lease time[1]: 86400 seconds
    00> [00:00:07.141,784] <inf> dhcp_test: Network 
    00> [00:00:10.727,905] <inf> net_mqtt: Connect completed
    00> [00:00:10.824,127] <dbg> aws_test: mqtt_event_cb: MQTT event: CONNACK [0] result: 0
    00> [00:00:10.824,157] <inf> aws_test: Subscribing to 1 topic(s)
    00> [00:00:10.923,370] <dbg> aws_test: mqtt_event_cb: MQTT event: SUBACK [7] result: 0
    00> [00:00:10.927,062] <inf> aws_test: PUBLISHED on topic "elimo/data" [ id: 1 qos: 0 ], payload: 13 B
    00> [00:00:10.927,093] <dbg> aws_test: publish_message: Published payload:
    00>                                    7b 22 63 6f 75 6e 74 65  72 22 3a 30 7d          |{"counte r":0}   
    00> [00:00:11.028,839] <err> aws_test: Socket closed/error
    00> [00:00:11.031,707] <inf> net_mqtt_sock_tls: Closing socket 0
    00> [00:00:11.036,956] <dbg> aws_test: mqtt_event_cb: MQTT event: DISCONNECT [1] result: 0
    00> [00:00:11.036,987] <err> net_sock: invalid access on sock 0 by thread 0x200028b8
    00> [00:00:11.036,987] <inf> aws_test: mbedTLS heap usage: MAX 47488/65536 (82) CUR 0 (0)
    00> [00:00:12.037,078] <inf> aws_test: Resolve broker address...
    00> [00:00:12.042,022] <inf> aws_test: Resolved: 13.50.43.73:8883
    00> [00:00:12.042,053] <inf> aws_test: AWS client loop...
    00> [00:00:15.532,989] <inf> net_mqtt: Connect completed
    00> [00:00:15.652,862] <dbg> aws_test: mqtt_event_cb: MQTT event: CONNACK [0] result: 0
    00> [00:00:15.652,893] <inf> aws_test: Subscribing to 1 topic(s)
    00> [00:00:15.744,873] <dbg> aws_test: mqtt_event_cb: MQTT event: SUBACK [7] result: 0
    00> [00:00:15.748,535] <inf> aws_test: PUBLISHED on topic "elimo/data" [ id: 2 qos: 0 ], payload: 13 B
    00> [00:00:15.748,565] <dbg> aws_test: publish_message: Published payload:
    00>                                    7b 22 63 6f 75 6e 74 65  72 22 3a 30 7d          |{"counte r":0}   
    00> [00:00:15.850,402] <err> aws_test: Socket closed/error
    00> [00:00:15.853,240] <inf> net_mqtt_sock_tls: Closing socket 0
    00> [00:00:15.858,489] <dbg> aws_test: mqtt_event_cb: MQTT event: DISCONNECT [1] result: 0
    00> [00:00:15.858,489] <err> net_sock: invalid access on sock 0 by thread 0x200028b8
    00> [00:00:15.858,520] <inf> aws_test: mbedTLS heap usage: MAX 47488/65536 (82) CUR 0 (0)
    
    # Logging stopped @ 15 Jun 2024 12:48:59
    

  • Finally, after some investigation, I confirm that disabling one of the enabled DBG LOG re-creates the issue.

    I have also noticed that even during the full successful download, the disconnection happens but the download client is able to reconnect and keep downloading. Here is the related piece of RTT log:

    To recap, at 50% of download the download client allegedly experiences a disconnection and can't reconnect (errno -2). If then I add:

    CONFIG_LOG_BLOCK_IN_THREAD=y
    CONFIG_NET_SOCKETS_LOG_LEVEL_DBG=y
    CONFIG_MQTT_HELPER_LOG_LEVEL_DBG=y
    CONFIG_MQTT_LOG_LEVEL_DBG=y
    CONFIG_DOWNLOAD_CLIENT_LOG_LEVEL_DBG=y
    the download client is able to reconnect and the full FOTA process succeeds.
    Important: CONFIG_LOG_BLOCK_IN_THREAD=y is definitely crucial. 
  • Hi,

     

    Marco Russi said:
    I did not set CONFIG_LOG_MODE_IMMEDIATE=y anywhere so I guess that I am using the default mode. 

    The default is normally deferred logging, but just to be sure; please verify by checking your build-folder/zephyr/.config file.

    Marco Russi said:
    When you say that I am having issues with logging do you mean that the log is related to the download issue I have? How can CONFIG_LOG_BLOCK_IN_THREAD allow the download client to reconnect successfully? And could also the log cause the disconnection at exactly 50% ?

    If your logger does in-place logging, you are effectively printing and blocking all threads while printing. This will cause problems, especially if you log in interrupt context (SPI for instance)

     

    If you play around with the NET buffer sizes, do you see a difference in how the device behaves?

    For instance by setting "CONFIG_NET_BUF_DATA_SIZE=1024" or adjusting the amount of buffers available?

     

    Kind regards,

    Håkon

  • Hi, I have fixed the log as you suggested but the 50% download issue remains. I the tried to set CONFIG_NET_BUF_DATA_SIZE=1024 but that meant also setting CONFIG_NET_BUF_FIXED_DATA_SIZE=y. Unfortunately this increases RAM footprint a lot and linker fails. I anyway managed to make it but the fw crashes by stack overflow when DHCP starts (or when it gets response from the DHCP server). I then increased both my task and the system work queue stack but I can't solve the stack overflow issue. Note that I can't make stacks too big as RAM would not fit.

    Any suggestions?

  • I tried to set CONFIG_NET_BUF_DATA_SIZE=1024 and not set CONFIG_NET_BUF_FIXED_DATA_SIZE=y and it compiles fine. Stack overflow happen though.

  • Hi,

     

    As a generic note:

    If you have faults occurring, please report the full log.

     

    In this case, revert to your original NET_BUF_* configuration, and rather adjust the overall buffer size via CONFIG_NET_BUF_DATA_POOL_SIZE and see if this changes the behavior.

    We have not tested the download sample with other devices than the IP-capable nRF's (nRF91-series, nRF70-series), so I would expect that there could be some adjustments that you need to do to make it fit for the W5500, in terms of generic configuration.

     

    Kind regards,

    Håkon

  • Adding CONFIG_NET_BUF_DATA_POOL_SIZE implies enabling CONFIG_NET_BUF_VARIABLE_DATA_SIZE which is experimental. I have tried and a task goes in stak overflow. I would not expect that. 

    Anyway I am playing with all those config macros and I guess I'll need to spend days trying to find the right combination with little clue of what I am doing. 

    Thank you for your help so far. 

Reply Children
  • When experiencing a fatal error / fault, please share the output.

    Marco Russi said:
    Anyway I am playing with all those config macros and I guess I'll need to spend days trying to find the right combination with little clue of what I am doing. 

    The default setting for CONFIG_NET_BUF_DATA_SIZE is 128, try to atleast adjust it to 256, and see if the failure point moves from 50% to something else.

     

    Kind regards,

    Håkon

  • HI, I have tried that but same result.

    Anyway I have just found something interesting: at 50% the AWS server (peer) sends an HTTP header with "connection: close" so the download client reconnects as expected. Indeed the header length is 19 bytes longer than previous headers.

    This causes a reconnection attempt that fails due to some reason still to find out but the main problem is that the peer sends "connection: close" in the http header. 

    Any idea why that happens?

    NOTE: Ingore the BUBU part in the "Peer closed connection, ..." LOG line as I put that do better identify the line in the code. Indeed the piece of code that detects the disconnection request is function http_header_parse in http.c in the nrf download client library :

    p = strnstr(client->buf, "\r\nconnection: close", sizeof(client->buf));
    if (p) {
    LOG_HEXDUMP_DBG(client->buf, *hdr_len, "HTTP response");
    LOG_WRN("BUBU Peer closed connection, will re-connect");
    client->http.connection_close = true;
    }

    I have also added the LOG line to print the header when that happens. 

    Thanks. 

  • Hi,

    problem solved.

    Apparently the http fragmentation size must be 4096. I have added:

    CONFIG_DOWNLOAD_CLIENT_BUF_SIZE=4096
    CONFIG_DOWNLOAD_CLIENT_HTTP_FRAG_SIZE_4096=y
    and it now completes successfully. 
    If I define the fragmentation size to 1024 then I fails at 25%. The fact that it was originally failing at 50% is because the size was apparently 2048 by default. 
    What do you think ?
    Thanks. 
  • Hi,

     

    As previously mentioned, the download library is developed and tested with the IP-capable nRF devices in mind, so it highly likely needs to be adjusted/configured differently for your chosen ethernet bridge.

    What I can recommend, in addition to what you've already discovered, is to try to disable automatic ranges and see if this happens to have an impact: CONFIG_DOWNLOAD_CLIENT_RANGE_REQUESTS=n

      

    Kind regards,

    Håkon

Related