nRF Cloud CoAP shadow GET fails for 7+ hours — DTLS gateway healthy, shadow service unresponsive

nRF Cloud — CoAP transport (device shadow)

Date of incident:
2026-06-22, ~13:04 UTC — still unresolved at 20:42 UTC (7h38m+)
Device: nRF9151, modem firmware mfw_nrf91x1_2.0.4, NCS v3.2.3
SIM: Soracom roaming on One NZ, PLMN 53001, Band 28, LTE-M

Summary

Our device experienced a sustained 7+ hour period where every CoAP shadow GET to coap.nrfcloud.com timed out with no response. The DTLS Connection ID session resumed correctly in ~12ms every cycle, confirming the DTLS gateway was alive. The failure seems to be at the CoAP application layer — the shadow service (or internal routing to it) was not responding.

Observed behaviour (235 consecutive cycles over 7h):

→ DTLS CID resume:               12ms  ✓ (gateway healthy)
→ CoAP GET /state/delta:         sent
→ 1st retransmit:                +3s   (no ACK from server)
→ 2nd retransmit:                +6s
→ 3rd retransmit:                +12s
→ 4th retransmit:                +21s
→ Timeout, no more retries:      +47s
→ nrf_cloud_coap error:          -116 (ENOTCONN)
→ Disconnect → wait 30s → reconnect → repeat

Log excerpt (one representative cycle):

[08:31:28] cloud_connection: Connected to nRF Cloud
[08:31:31] net_coap: Timeout, retrying send
[08:31:36] net_coap: Timeout, retrying send
[08:31:46] net_coap: Timeout, retrying send
[08:32:04] net_coap: Timeout, retrying send
[08:32:47] net_coap: Timeout, no more retries left
[08:32:47] nrf_cloud_coap: Shadow response processing error: -116
[08:32:47] shadow_support_coap: Failed to request shadow delta: -116
[08:32:47] cloud_connection: Communication error detected.
[08:32:47] cloud_connection: Disconnected from nRF Cloud
[08:32:47] cloud_connection: Retrying in 30 seconds...

What this rules out:

  • Device firmware bug — 0 crashes, 0 reboots across 235 cycles; DISCONNECT_ON_FAILED_REQUEST=y working correctly
  • Network/SIM issue — DTLS CID resume at 12ms every cycle; RSRP -87dBm, CE level 0
  • DTLS gateway issue — gateway is responding to CID resumes; failure is at CoAP payload layer

Impact:

  • 235 GNSS location fixes discarded (CoAP POST fails while shadow GET fails)
  • No cloud shadow updates for entire duration

Configuration:

CONFIG_NRF_CLOUD_COAP=y
CONFIG_NRF_CLOUD_COAP_DTLS_CID=y
CONFIG_NRF_CLOUD_COAP_DISCONNECT_ON_FAILED_REQUEST=y
CONFIG_LTE_RAI_REQ=y

Question

  1. Is the shadow GET the first request that requires a backend roundtrip after DTLS resume, or should a healthy backend always respond within the default CoAP timeout?
Related