Download client (via AWS FOTA) stops at 50% and can't reconnect to AWS

Hello,

I am using the AWS IoT library and I came across an issue when downloading a new image (FOTA) from AWS. The FOTA process starts so the image is downloaded up to 50% where apparently the peer closes the connection (received length is 0) and it the download client can't reconnect to the server. The overall FOTA process then fails. 

My board is the nrf52840DK with the Ethernet W5500 Arduino shield on top. MQTT pub and sub works fine (also with device shadow). I using nRF SDK 2.6.0. 

I attach here my prj.conf file 1072.prj.conf

Here is the error:

More details: as you can see from the prj.conf file, I am using the Zephyr built-in mbedTLS and all communications are with TLS on (MQTTS and HTTPS). 

As you see from the log at around 50% the download client prints "Peer closed connection, will reconnect" and then fails to reconnect throwing a errno 2 (No such file or directory).

Looking at the download client source code, the "Peer closed connection" error is when a received packet has length 0. 

It always fails in the same way. No idea why,

Please help.

Thank you.

Marco 

Parents
  • Update:

    I have managed to get a full download working. I have simply enabled the download client DBG log and it magically worked. Although if I disable the DBG level log then the issue appears again. I guess the issue has something to do with timing as the DBG level log should slow everything down due to the more verbose log being printed. Indeed I have also added these configs to not miss any LOG message:

    CONFIG_SEGGER_RTT_BUFFER_SIZE_UP=4096
    CONFIG_SEGGER_RTT_MODE_BLOCK_IF_FIFO_FULL=y
    CONFIG_LOG_BLOCK_IN_THREAD=y
    CONFIG_LOG_BLOCK_IN_THREAD_TIMEOUT_MS=-1

    Any thoughts? 

    Here is the new prj.conf file 04737.prj.conf

    and the full RTT log that shows that the download completed successfully. I also see the new application running as I made a specific application that prints " THIS IS THE NEW IMAGE FROM AWS S3!"

    download_full.log

    # SEGGER J-Link RTT Viewer V7.88e Terminal Log File
    # Compiled: 16:28:05 on May 31 2023
    # Logging started @ 15 Jun 2024 12:48:27
    00> [00:00:00.007,659] <inf> daliif: DALI IF READY!
    00> [00:00:00.016,052] <inf> eth_w5500: W5500 Initialized
    00> *** Booting nRF Connect SDK v3.5.99-ncs1 ***
    00> [00:00:00.017,456] <inf> demoapp: DALi Demo App.
    00>  Tag: 
    00>  Branch: conn_mng
    00>  Hash: @a14563a
    00>  Dirty: true
    00>  semver for MCUBOOT: ..
    00> [00:00:00.017,456] <inf> demoapp: THIS IS THE NEW IMAGE FROM AWS S3!
    00> [00:00:00.017,913] <inf> demoapp: Launching AWS test application: OK
    00> [00:00:00.017,974] <inf> dhcp_test: Network connectivity lost
    00> [00:00:00.018,066] <inf> dhcp_test: Starting DHCP on w5500@0: index=1
    00> [00:00:07.141,204] <inf> net_dhcpv4: Received: 192.168.1.117
    00> [00:00:07.141,418] <inf> dhcp_test:    Address[1]: 192.168.1.117
    00> [00:00:07.141,479] <inf> dhcp_test:     Subnet[1]: 255.255.255.0
    00> [00:00:07.141,540] <inf> dhcp_test:     Router[1]: 192.168.1.1
    00> [00:00:07.141,571] <inf> dhcp_test: Lease time[1]: 86400 seconds
    00> [00:00:07.141,784] <inf> dhcp_test: Network 
    00> [00:00:10.727,905] <inf> net_mqtt: Connect completed
    00> [00:00:10.824,127] <dbg> aws_test: mqtt_event_cb: MQTT event: CONNACK [0] result: 0
    00> [00:00:10.824,157] <inf> aws_test: Subscribing to 1 topic(s)
    00> [00:00:10.923,370] <dbg> aws_test: mqtt_event_cb: MQTT event: SUBACK [7] result: 0
    00> [00:00:10.927,062] <inf> aws_test: PUBLISHED on topic "elimo/data" [ id: 1 qos: 0 ], payload: 13 B
    00> [00:00:10.927,093] <dbg> aws_test: publish_message: Published payload:
    00>                                    7b 22 63 6f 75 6e 74 65  72 22 3a 30 7d          |{"counte r":0}   
    00> [00:00:11.028,839] <err> aws_test: Socket closed/error
    00> [00:00:11.031,707] <inf> net_mqtt_sock_tls: Closing socket 0
    00> [00:00:11.036,956] <dbg> aws_test: mqtt_event_cb: MQTT event: DISCONNECT [1] result: 0
    00> [00:00:11.036,987] <err> net_sock: invalid access on sock 0 by thread 0x200028b8
    00> [00:00:11.036,987] <inf> aws_test: mbedTLS heap usage: MAX 47488/65536 (82) CUR 0 (0)
    00> [00:00:12.037,078] <inf> aws_test: Resolve broker address...
    00> [00:00:12.042,022] <inf> aws_test: Resolved: 13.50.43.73:8883
    00> [00:00:12.042,053] <inf> aws_test: AWS client loop...
    00> [00:00:15.532,989] <inf> net_mqtt: Connect completed
    00> [00:00:15.652,862] <dbg> aws_test: mqtt_event_cb: MQTT event: CONNACK [0] result: 0
    00> [00:00:15.652,893] <inf> aws_test: Subscribing to 1 topic(s)
    00> [00:00:15.744,873] <dbg> aws_test: mqtt_event_cb: MQTT event: SUBACK [7] result: 0
    00> [00:00:15.748,535] <inf> aws_test: PUBLISHED on topic "elimo/data" [ id: 2 qos: 0 ], payload: 13 B
    00> [00:00:15.748,565] <dbg> aws_test: publish_message: Published payload:
    00>                                    7b 22 63 6f 75 6e 74 65  72 22 3a 30 7d          |{"counte r":0}   
    00> [00:00:15.850,402] <err> aws_test: Socket closed/error
    00> [00:00:15.853,240] <inf> net_mqtt_sock_tls: Closing socket 0
    00> [00:00:15.858,489] <dbg> aws_test: mqtt_event_cb: MQTT event: DISCONNECT [1] result: 0
    00> [00:00:15.858,489] <err> net_sock: invalid access on sock 0 by thread 0x200028b8
    00> [00:00:15.858,520] <inf> aws_test: mbedTLS heap usage: MAX 47488/65536 (82) CUR 0 (0)
    
    # Logging stopped @ 15 Jun 2024 12:48:59
    

  • Finally, after some investigation, I confirm that disabling one of the enabled DBG LOG re-creates the issue.

    I have also noticed that even during the full successful download, the disconnection happens but the download client is able to reconnect and keep downloading. Here is the related piece of RTT log:

    To recap, at 50% of download the download client allegedly experiences a disconnection and can't reconnect (errno -2). If then I add:

    CONFIG_LOG_BLOCK_IN_THREAD=y
    CONFIG_NET_SOCKETS_LOG_LEVEL_DBG=y
    CONFIG_MQTT_HELPER_LOG_LEVEL_DBG=y
    CONFIG_MQTT_LOG_LEVEL_DBG=y
    CONFIG_DOWNLOAD_CLIENT_LOG_LEVEL_DBG=y
    the download client is able to reconnect and the full FOTA process succeeds.
    Important: CONFIG_LOG_BLOCK_IN_THREAD=y is definitely crucial. 
Reply
  • Finally, after some investigation, I confirm that disabling one of the enabled DBG LOG re-creates the issue.

    I have also noticed that even during the full successful download, the disconnection happens but the download client is able to reconnect and keep downloading. Here is the related piece of RTT log:

    To recap, at 50% of download the download client allegedly experiences a disconnection and can't reconnect (errno -2). If then I add:

    CONFIG_LOG_BLOCK_IN_THREAD=y
    CONFIG_NET_SOCKETS_LOG_LEVEL_DBG=y
    CONFIG_MQTT_HELPER_LOG_LEVEL_DBG=y
    CONFIG_MQTT_LOG_LEVEL_DBG=y
    CONFIG_DOWNLOAD_CLIENT_LOG_LEVEL_DBG=y
    the download client is able to reconnect and the full FOTA process succeeds.
    Important: CONFIG_LOG_BLOCK_IN_THREAD=y is definitely crucial. 
Children
  • Hi,

     

    What log mode are you using?

     

    If you are using log mode immediate (CONFIG_LOG_MODE_IMMEDIATE=y), then you'll do in-place logging; which will cause a timing skew in your firmware.

    What I would recommend is that you use "CONFIG_LOG_MODE_DEFERRED", to ensure that logging takes place in a dedicated thread.

    Marco Russi said:
    Important: CONFIG_LOG_BLOCK_IN_THREAD=y is definitely crucial.

    This also implies that you're having issues with logging. Instead; try increasing the dedicated log buffer size:

    CONFIG_LOG_BUFFER_SIZE=10240

     

    Kind regards,

    Håkon

  • Hi thank you.

    I did not set CONFIG_LOG_MODE_IMMEDIATE=y anywhere so I guess that I am using the default mode. 

    When you say that I am having issues with logging do you mean that the log is related to the download issue I have? How can CONFIG_LOG_BLOCK_IN_THREAD allow the download client to reconnect successfully? And could also the log cause the disconnection at exactly 50% ?

    Thanks. 

  • Hi,

     

    Marco Russi said:
    I did not set CONFIG_LOG_MODE_IMMEDIATE=y anywhere so I guess that I am using the default mode. 

    The default is normally deferred logging, but just to be sure; please verify by checking your build-folder/zephyr/.config file.

    Marco Russi said:
    When you say that I am having issues with logging do you mean that the log is related to the download issue I have? How can CONFIG_LOG_BLOCK_IN_THREAD allow the download client to reconnect successfully? And could also the log cause the disconnection at exactly 50% ?

    If your logger does in-place logging, you are effectively printing and blocking all threads while printing. This will cause problems, especially if you log in interrupt context (SPI for instance)

     

    If you play around with the NET buffer sizes, do you see a difference in how the device behaves?

    For instance by setting "CONFIG_NET_BUF_DATA_SIZE=1024" or adjusting the amount of buffers available?

     

    Kind regards,

    Håkon

  • Hi, I have fixed the log as you suggested but the 50% download issue remains. I the tried to set CONFIG_NET_BUF_DATA_SIZE=1024 but that meant also setting CONFIG_NET_BUF_FIXED_DATA_SIZE=y. Unfortunately this increases RAM footprint a lot and linker fails. I anyway managed to make it but the fw crashes by stack overflow when DHCP starts (or when it gets response from the DHCP server). I then increased both my task and the system work queue stack but I can't solve the stack overflow issue. Note that I can't make stacks too big as RAM would not fit.

    Any suggestions?

  • I tried to set CONFIG_NET_BUF_DATA_SIZE=1024 and not set CONFIG_NET_BUF_FIXED_DATA_SIZE=y and it compiles fine. Stack overflow happen though.

Related