This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Asset Tracker V2 failed_data buffer not filling

I'm testing my own firmware based on the current Asset tracker V2 code (AWS cloud).

The code generally runs well, but when reception drops (simulated by disconnecting the antenna), the firmware does not detect that a packet has not reached AWS. Which results in data loss.

The readme states:

The application has LTE and cloud connection awareness. Upon a disconnect from the cloud service, the application keeps the sensor data that has been buffered and empty the buffers in batch messages when the application reconnects to the cloud service.

Although I can't find anywhere where the firmware detects that the LTE link fails or drops. 
After the antenna has been removed, RSRP drops to zero, ring buffers get filled with data, cloud_module sends data. And then data_module ACKs the pending data. The data never gets moved to the failed_data list.

I feel like I'm missing something, but I'm not sure what it is. All data seems to be acked, no matter the connection status.

- Is there a connection with the MQTT QOS config?

- After a while "aws_iot" sends "aws_iot.mqtt_evt_handler: MQTT_EVT_DISCONNECT: result = -116" which then triggers "modules_common: cloud: Message could not be enqueued, error code: -35" and a reboot. This also results in data loss of course. How come that the modem doesn't detect a packet didn't get acked? What are the timeouts, and where can I find them?

Thanks!

Jelmer

Parents
  • Hi,

    The application detects if the device is connected/disconnected from LTE/MQTT via events that is received in the various modules. The cloud_module and data_module should block/unblock data transmission depending on being connected to MQTT or not.

    The modem_module is responsible for propagating the events MODEM_EVT_LTE_CONNECTED and MODEM_EVT_LTE_DISCONNECTED that is used in the cloud_module to govern cloud connection/re-connection and data transmission.

    In turn, the data_module listens to events from the cloud_module, more specifically CLOUD_EVT_CLOUD_CONNECTED and CLOUD_EVT_CLOUD_DISCONNECTED. And, decides based on those events if data should be encoded and sent to the cloud_module.

    It seems that in the describes scenario the application is not able to detect that it is disconnected from LTE (no `MODEM_EVT_LTE_DISCONNECTED`event and keeps on sending data, essentially filling the message queue used to schedule MQTT data transmission in the `cloud_module` thread. I suspect that the cloud_module thread is blocked in data_send that makes the thread unable to process further message queue items. This eventually leads to the message queue being filled which triggers a reboot.

    There is ongoing work to improve this by properly handling socket timeouts in case a send blocks for too long. This way the application can recover more gracefully.

    It would be most helpful if you could provide debug logs (alternatively modem trace logs) that could help us identify the exact issue at hand.

  • Thanks for your answer and clarification !

    After going through the events all the way back to the AT commands, it seems like the culprit is this previously reported issues: https://devzone.nordicsemi.com/f/nordic-q-a/63267/getting-cereg-5-fffe-ffffffff-while-disconnected-from-the-network

    When I disconnect the antenna, RSRP drops and the modem reports:

    [00:01:01.965,759] <dbg> lte_lc.at_handler: +CEREG notification: +CEREG: 5,"FFFE","FFFFFFFF",7,,,"11100000","11100000"

    With a network registration status of 5, the lte_lc never detects there is an issue, or a disconnect. So there are no messages to the application that a disconnect occured.

    The post I linked to mentions this would be resolved in firmware 1.3, is this the case? We currently have devices with 1.2 in the field suffering of this issue.

  • Hi, modem firmware 1.3 has been released so this can be tested by flashing the latest version found on this page: https://www.nordicsemi.com/Products/Development-hardware/nRF9160-DK/Download.

    As for the devices you have in the field the issue can be fixed by doing a modem FOTA of the modem firmware, or the application, which includes functionality that interprets a cell ID being reported as "FFFFFFFF" as a disconnect.

Reply Children
  • Our units in the field are Revision 1 that are currently running modem firmware 1.2.3. So I'll test the issue with 1.3, but even if that fixes the problem, upgrading our rev1 chips to 1.3 isn't officially supported, so it wouldn't be a fix.

    Are there plans to upgrade lte_lc.c and the link control lib to interpret cell id instead of network reg status? I doubt we're the only ones in this position and having this issue, so an official bugfix in the SDK might be a good solution.

  • There are currently no plans of including such functionality in the link controller. But providing a solution which addresses this issue for all modem builds is appealing. I will discuss this with my co-workers and get back to you. Please note that there is still vacation period in Norway so it might take some time before I respond.

  • Thanks for your help .

    Some more arguments from our side to make the SDK compatible with both 1.2.3 and 1.3:

    - Currently MFW 1.2.3 wrongly reports the online status of the modem in the AT command. This has decent consequences for the user application using this modem firmware. It seems like the bug has been there for at least a year. We'd consider this a high priority bug.
    - Luckily the bug can easily be fixed with a couple lines of code in the SDK, without breaking compatibility with the future modem firmware versions.
    - Upgrading modem firmware to 1.3 might be a solution, but is not supported by Nordic for the hardware in question, and Nordic has mentioned to keep supporting this hardware, since AFAIK there's already large deployments of these chips in the field (rev 1).

    So we don't see why you wouldn't fix this bug with a patch in the SDK. The alternative would be to release mfw 1.2.4, which could fix the issue, but will be extremely costly on the Nordic side and the customer side (testing and certifying a whole new modem firmware + pushing and upgrading large modem firmwares for all the devices currently deployed vs just pushing a smaller new version of the user app).

  • Hi! . There is currently a fix in PR for this issue. It would be great if you could test and review https://github.com/nrfconnect/sdk-nrf/pull/5353 to see if it resolves the issue you are seeing.

  • Took me a while to verify this, but that PR seems to fix my issue.

    Thanks to the team for implementing the fix!

Related