nRFCloud CoAP messages not received

Hello DevZone


I am currently facing an issues with device already installed at my customer.
Devices are based on nRF9160, firmware based on SDK 2.8.0 and modem 1.3.6.
Once deviced are installed, they do not move.

Device works like this:
Sleep 15 min
Wakeup
Take a measurement
If modem was switched off (no PSM):
    -> Reconnect to nRFCloud: 

nrf_cloud_coap_disconnect() ;	// From any previous sesssio. Just to be sure
nrf_cloud_coap_init() ;
err = nrf_cloud_coap_connect(FIRMWARE_VERSION) ;

Send sensor measurement to nRFCloud via CoAP
If PSM is available:
    -> Go IDLE / PSM
If not:
    -> Switch off modem to disconnect from network


I have ~4000 units deployed with similar HW and FW, and everything is working as expected.
However, I have two devices facing the same problem: the messages sent by the devices are only partially received on nRFCloud. 
(there might be more device, but not detected yet)


The two reported cases share one common thing : the MNO doesn't offer PSM with timings acceptable for the device operation.
Therefore, the device disconnects every time from the network after sending.


One device is located in Martinique (French Antille) and has access to two LTE-M networks: Orange and SFR. Both with good coverage (ConEval and RSRP are good), and not providing PSM.
If the device connects to SFR, everything works fine. However, if it connects to Orange, many messages are lost. It also seems that the more time, the more loss.
At initially though that it was a "bad network", or a "bad antenna" or some restriction between my MNVO and the local provider.
Then, the second case occured, in Italy.
The device connects to Vodafone in LTE-M, no PSM.
Looking at the device log, all messages are sent properly.
On the SIM card side, the amount of data consumed and the connect/disconnect also show a normal behavior.
However, not all messages are received on nRFCloud.
This time, since I have several other devices connected to the same network (but not the same cell), so I am quite confident that there is no MVNO/local operator issue.



CoAP messages are sent using

nrf_cloud_coap_json_message_send(msg, false, false)
, that always return 0. Payload is pretty small, with msg being a string with approx 20 chars.
I understand that using the confirmable feature would be more secure. This has not been done so far, as the firmware was previously based on SDK 2.5.0 and 2.6.0, where this feature was not available.

What could be causing the issue ?
I understand that without confirmable, some messages might be lost. However, we are here talking of more than 50% messages lost, even in good network conditions.

On the image attached, I am expecting a continuous flow of 4 messages / hour (and a bit more during the night).


My first idea is that the modem gets disconnected before the message is fully sent. However, this is not consistent with the data consumed by the SIM card (at least, it's not obvious)

Is there any explanation to this ? Are there reported "bad network cells", that could explain this behavior ?

Thanks for your help.

Parents
  • Hi,

    Thanks for the clear description and the plot. I think the good RSRP and connection evaluation values alongside missing messages in nRF Cloud usually means we should look beyond the bad signal alone. You mentioned that you are using non-confirmable CoAP and there is no ACK from the server, so in this case the stack has no way to know if the message was received. A return value of 0 from nrf_cloud_coap_json_message_send only means the UDP packet left the device stack, not that nRF Cloud received or stored it. So could you try to switch to confirmable CoAP (confirmable=true) on the test device? and check the return values.

    Also, since the modem is powered off after each cycle, a full DTLS handshake and reauthentication with nRF Cloud is usually required on every reconnect. Could you also log the return value of nrf_cloud_coap_connect() each cycle to confirm it always succeeds before the send? 

    Can you provide a modem trace on an affected device during a failed send cycle? As this will help us to see whether the DTLS handshake, PDN attach, or UDP transmission is failing at the network level. Thanks

    Best Regards,
    Syed Maysum

  • Thanks for your response.

    Trying confirmable=true will off course be my next test.
    However, I'll have to check the impact on power consumption: I assume that the modem will be on for a longer period of time, to be able to receive the acknoledge.

    The full handshake is performed at every reconnection, and result of is always 0.

    I'll what I can do for modem trace. (I am not familliar with this), since the device is currently being used by the customer.

Reply
  • Thanks for your response.

    Trying confirmable=true will off course be my next test.
    However, I'll have to check the impact on power consumption: I assume that the modem will be on for a longer period of time, to be able to receive the acknoledge.

    The full handshake is performed at every reconnection, and result of is always 0.

    I'll what I can do for modem trace. (I am not familliar with this), since the device is currently being used by the customer.

Children
  • Hi,

    Thanks for the update, and that power concern is completely valid. Using confirmable=true is a good next step. It can increase modem on time slightly (waiting for CoAP ACK / retries), so checking power impact on one test device is the right approach.

    A modem trace captures low-level modem/network communication, which helps us see where the failure happens. A practical way is to build a trace enabled firmware using snippets and capture with Cellular Monitor or nrfutil trace lte. If the affected unit is deployed and physical/debug access is not possible, that is understandable. In that case, could you try getting the device and reproducing this issue in the lab on a same unit with the same operator, and capture a trace there? 

    Optional longer-term path: if you already use Memfault (or plan to), it can be used for remote observability and modem trace related workflows in production.

    Best Regards,
    Syed Maysum

Related