Download client gets stuck in an invalid state after an EBADMSG

When testing the robustness of the FOTA downloader/download client I came across a bug. When the download client receives an unexpected/invalid packet the client correctly handles this with calling the callback with an error status. When the download is started again the download fails in another way,  and gets stuck. I'm downloading trough CoAP and am using NCS2.1.2.

Observations

To test this I set up a CoAP server using the CoAP package (NodeJS). I created an endpoint called '/fw/timeout'. The callback of this endpoint is empty which results in the stack responding with an empty CoAP ACK message (which is a valid message). I use the FOTA downloader to download this. The first attempt results in this log:

[00:00:35.408,782] <dbg> download_client: client_connect: family: 1, type: 2, proto: 17
[00:00:35.409,057] <inf> download_client: Connecting to coap://server_here:6666
[00:00:35.409,088] <dbg> download_client: client_connect: fd 1, addrlen 8, fam IPv4, port 6666
[00:00:35.410,827] <dbg> download_client: coap_request_send: CoAP next block: 0
[00:00:35.411,346] <inf> download_client: Downloading: fw/timeout [0]
[00:00:35.411,468] <dbg> download_client: download_thread: Receiving up to 600 bytes at 0x2000fe2d...
[00:00:39.358,551] <dbg> download_client: request_resend: Socket timeout, resending
[00:00:39.359,283] <dbg> download_client: coap_request_send: CoAP next block: 0
[00:00:39.368,072] <dbg> download_client: download_thread: Receiving up to 600 bytes at 0x2000fecd..
[00:00:41.052,703] <dbg> download_client: download_thread: Read 4 bytes from socket
[00:00:41.052,734] <err> download_client: Failed to get current from CoAP packet, err -2
[00:00:41.053,009] <err> fota_download: Download client error
[00:00:41.053,009] <dbg> fota_download: download_client_callback: No DFU target was initialized
[00:00:41.053,039] <err> fw_upd: FOTA download failed: 1

We see that after a first timeout four bytes are downloaded, this is the empty ACK. After this the client correctly exits with an error. This is the log of the second attempt:

[00:05:36.817,871] <dbg> download_client: client_connect: family: 1, type: 2, proto: 17
[00:05:36.818,206] <inf> download_client: Connecting to coap://server_here:6666
[00:05:36.818,237] <dbg> download_client: client_connect: fd 1, addrlen 8, fam IPv4, port 6666
[00:05:36.819,274] <dbg> download_client: coap_request_send: CoAP next block: 0
[00:05:36.819,793] <inf> download_client: Downloading: fw/timeout [0]
[00:05:36.820,312] <dbg> download_client: download_thread: Receiving up to 600 bytes at 0x2000fddd...
[00:05:36.820,343] <dbg> download_client: request_resend: Socket timeout, resending
[00:05:36.821,105] <dbg> download_client: coap_request_send: CoAP next block: 0
[00:05:36.821,655] <dbg> download_client: download_thread: Receiving up to 600 bytes at 0x2000fe5d...
[00:05:36.821,655] <dbg> download_client: request_resend: Socket timeout, resending
[00:05:36.822,418] <dbg> download_client: coap_request_send: CoAP next block: 0
[00:05:36.822,967] <dbg> download_client: download_thread: Receiving up to 600 bytes at 0x2000fedd...
[00:05:36.822,998] <dbg> download_client: request_resend: Socket timeout, resending
[00:05:36.823,730] <dbg> download_client: coap_request_send: CoAP next block: 0
[00:05:36.824,249] <dbg> download_client: download_thread: Receiving up to 600 bytes at 0x2000ff5d...
[00:05:36.824,279] <dbg> download_client: request_resend: Socket timeout, resending
[00:05:36.824,279] <err> download_client: CoAP max-retransmissions exceeded
[00:05:36.824,279] <wrn> fota_download: Download socket error. 2 retries left...

In this log we see that the download fails in another way. I verified using Wireshark that the server still responds in the same way. The 'Socket timeout' message comes immediately after the 'Receiving up to..' message. After the 'Download socket error. 2 retries left...' the FOTA/download client is stuck, a new download attempt can not be started.

Tracking down the bug

Before receiving trough recv, coap_get_recv_timeout is called in the function socket_recv in file download_client.c:435. In the faulty case this function returns zero, which causes the instant timeouts. This is caused by the coap.pending struct not being resetted after receiving the first EBADMSG. In the file coap.c:165 coap_pending_clear is called which resets the pending struct, but this is not called when the packet is invalid, as it fails in one of the check above coap_pending_clear. To fix this the clear function needs to be moved.

I'm not sure (yet) what causes the 'Download socket error. 2 retries left...' message, and why it becomes stuck after this, but I think it may be related to the bug above.

Related