nRF Cloud coap hangs after a few hours using nRF9160

Hi

I have been looking into this for a while and might just be missing something silly but I would appreciate any help. Our application uses a modified nrf cloud multi service sample and sends regular readings from a few sensors but we noticed after a few hours of operation no more readings are sent. We are using an nRF9160 DK.

I have managed to replicate this a few times so I rebuilt the latest sample from v2.6.0 using VS Code and the following settings modifications:

CONFIG_LOCATION_TRACKING=n

# Increased logging
CONFIG_COAP_LOG_LEVEL_DBG=y
CONFIG_LOCATION_LOG_LEVEL_DBG=y
CONFIG_NRF_CLOUD_COAP_LOG_LEVEL_DBG=y
CONFIG_MULTI_SERVICE_LOG_LEVEL_DBG=y

CONFIG_NRF_MODEM_LIB_TRACE=y

# Higher sample time reduces time till fail
CONFIG_SENSOR_SAMPLE_INTERVAL_SECONDS=30

It was able to send the fake temperature readings for about 30mins before it failed. On most messages I would see a line similar to below indicating that the message 

[00:24:11.460,845] <dbg> message_queue: enqueue_device_message: Messages queue status: 0/50
[00:24:11.460,906] <dbg> message_queue: enqueue_device_message: Adding device message to queue
[00:24:11.460,937] <dbg> message_queue: consume_device_message: Waiting for valid connection before transmitting device message
[00:24:11.460,968] <dbg> message_queue: consume_device_message: Attempting to transmit enqueued device message
[00:24:11.460,998] <dbg> message_queue: consume_device_message: Messages remaining in queue: 0
[00:24:11.461,212] <dbg> nrf_cloud_coap_transport: client_transfer: NON POST msg/d2c Content-Format:CBOR, 28 bytes out, Accept:none
[00:24:11.470,886] <dbg> nrf_cloud_coap_transport: client_transfer: Sent
bf 01 64 54 45 4d 50 02  fb 40 37 33 33 33 33 33 |..dTEMP. .@733333
33 03 1b 00 00 01 8e e7  a9 ea 06 ff             |3....... ....
+CSCON: 1
[00:24:12.794,952] <dbg> net_coap: recv_response: Received 12 bytes
[00:24:12.795,013] <dbg> nrf_cloud_coap_transport: client_callback: result_code:2.01, offset:0x0, len:0x0, last_block:1
[00:24:12.795,013] <dbg> nrf_cloud_coap_transport: client_callback: End of client transfer
[00:24:12.795,135] <dbg> message_queue: consume_device_message: Enqueued device message consumed successfully

But when it hangs we normally don't get any messages after the send for example:

[01:53:07.660,156] <dbg> message_queue: enqueue_device_message: Adding device message to queue
[01:53:07.668,884] <dbg> nrf_cloud_coap_transport: client_transfer: Sent
                                                   bf 01 64 54 45 4d 50 02  fb 40 35 c2 8f 5c 28 f5 |..dTEMP. .@5..\(.
                                                   c3 03 1b 00 00 01 8e e7  73 44 29 ff             |........ sD).    
+CSCON: 1

Below you can see the connection starts but no application data is ever sent and the application never calls the log line in recv_response about receiving bytes.

I have run AT commands on a different unit and it suggests the modem is still working but I would expect the coap client to timeout as opposed to just hang. It have tried waiting but processing the message queue can hang for hours. It is resolved with a board reset but what is most annoying is I cannot see any error logged. Reading through the code in nrf_cloud_coap_transport.c the coap_client_req function seems to have a max timeout of 5 seconds which should cause the message to retry and log an error.

I have attached a modem trace which should contain some details but I am not really sure how to debug this further. The only similar topic I can find on dev zone is quite old and I do not believe to be related: https://devzone.nordicsemi.com/f/nordic-q-a/64392/coap-client-sample---send-command-freezes-after-a-few-iterations

2703.trace-2024-04-16T15-43-06.535Z.mtrace

As this was occurring on a fresh clone of the sample I am not sure it's an issue with our application code but would be happy to try anything you would suggest.

Thanks,

Sam

  • Hi Sam, just letting you know we have started looking into this.

    Best regards,
    Håkon

  • Hi Sam, this is a known issue in NCS 2.6.0. We are working on a fix.

    The problem occurs for NON transfers when the response from the server never comes. The Zephyr coap_client library's coap_client_req() function handles NON transfer responses differently than CON transfers (no timeouts, no retries).

    In that situation, in nrf_cloud_coap_transport.c: client_transfer(), this line waits forever:


    (void)k_sem_take(&cb_sem, K_FOREVER); /* Wait for coap_client to exhaust retries */

    As a workaround, change the line above to:

    #define NON_RESP_WAIT_S 3

    /* Wait for coap_client to exhaust retries when reliable transfer selected,
     * otherwise wait a finite time because response might never come.
     */
    err = k_sem_take(&cb_sem, reliable ? K_FOREVER : K_SECONDS(NON_RESP_WAIT_S));
    if (!err) {
            LOG_DBG("Got callback");
    } else {
            LOG_DBG("Got timeout: %d", err);
            err = 0; /* Ignore, since caller selected non-reliable transfer */
    }

  • Hi Pete, thanks for your detailed response. We did consider a similar fix but I was unsure if it would cause any other problems with the coap client. Thank you very much for your help.

Related