Sim card issues (CEREG 90), possibly due to freezing weather

Hello,

We have a set of ~10 units in the field (outside) in Belgium, which communicate via NB-IoT, and do an MQTT sync procedure every 4 hours.

During the freezing period of around December 15th, 2 units went offline for about a week (the temperature was around -5 degrees). The network operator (Orange) did not see anything specific occurring. The units have reset themselves periodically (part of the firmware). They "disappeared" not on the same day.

1 of the 2 units resumed some communication on december 19th, but not with the correct intervals (so it definitely still had some trouble). I went on site, and noticed LTE connect errors pointing to "+CEREG: 90" which means issues with the UICC (SIM card). Several manual resets (triggered via the reset line, not by unpowering the system) did not solve the issue. Eventually, I took out the sim card from the slot for a second, and put it back in and it all started working fine.

The second unit never managed to communicate autonomously, but after an on-site intervention, the sim card slot was also opened and re-closed, and the issue was fixed after a reset.

My first assumption is that the cold might have caused some issues: our PCB is in a casing but not yet fully coated / protected. However, it's very strange that the systems couldn't fully recover. So my doubts:

  • We never triggered a full power-cycle of the system, since it's battery based and includes supercapacitors so it's not that easy. Could this have made a difference?
  • Is it possible that the SIM cards or another part of the electronics get stuck after e.g. a load of invalid commands/responses, and that a power cycle is required to fix the issue? (I have to verify if the 1.8V effectively goes down on a reset).
  • I'm using Modem FW 1.3.0 and SDK 1.7.1, which are certainly not the latest ones. Would it make sense to upgrade these?
  • Is it possible to reset the sim card via a firmware command? Or would that be done intrinsically on e.g. lte_lc_offline and "AT+CFUN=0"?
  • Since it's clearly not easy to reproduce this issue, have there been any similar issues or can you think of a way to reproduce the issue?

Again, the cold temperatures might be related to the issue, since these units have been running for about 6 months without this specific issue, so it would be quite a coincidence that 2 units have the same issue in a time window of only 1 or 2 days, and for a duration of at least 1 week.

Best regards,

Sebastiaan

  • Hi,

     

    Thank you for providing such detailed information. I just wanted to make sure that the battery was able to provide sufficient peak outputs, even at lower temps.

    Based on the battery description and specification, it should have no problem driving the nrf9160, even at lower temperatures.

    Sebastiaan Merckx said:
    Both devices had exactly the same problem (CEREG:90), the second one was even fixed by just opening and closing the sim socket (not explicitly taking out the card) followed by a reset.

    thank you for confirming.

    Does your firmware handle the "CEREG: 90" notification? What it could do is to catch this scenario, and try again x minutes later. This is however not a full fix (which would be to check or replace the SIM holder), but a way to detect the scenario.

     

    Kind regards,

    Håkon

  • Hello Håkon,

    I don't handle the "CEREG: 90" notification specifically, but I have more generic recovery scenarios (e.g. first connection attempt can take 1 minute, then 4 minutes, then 7 minutes, with a "lte_lc_offline" and a "lte_lc_deinit" in between, and cold system resets if there is still no connection after the 3rd attempt.

    So I'm pretty sure that these devices have tried hundreds of times. You are right that it could help to implement a specific detection mechanism to get a better view of how often this occurs in the field (on other devices). I'll put that on low prio though.

    I also have a device running for almost 24 hours in the freezer now (yeah, seriously Slight smile) and syncing every 5 minutes which works fine so far. It's not one of the faulty 2 units though, so that can be another test we could plan.

    For now, I don't think we can make a lot of progress without reproductions, and I think it's best to close the ticket (or put it on hold) with the assumption that a mechanical issue causes the problem.

  • Hi,

     

    Sebastiaan Merckx said:

    I don't handle the "CEREG: 90" notification specifically, but I have more generic recovery scenarios (e.g. first connection attempt can take 1 minute, then 4 minutes, then 7 minutes, with a "lte_lc_offline" and a "lte_lc_deinit" in between, and cold system resets if there is still no connection after the 3rd attempt.

    So I'm pretty sure that these devices have tried hundreds of times. You are right that it could help to implement a specific detection mechanism to get a better view of how often this occurs in the field (on other devices). I'll put that on low prio though.

    Ok, that is good to know! What's important is that you are able to recover from the scenario, and it sounds like you have good control over that part!

     

    Sebastiaan Merckx said:

    I also have a device running for almost 24 hours in the freezer now (yeah, seriously Slight smile) and syncing every 5 minutes which works fine so far. It's not one of the faulty 2 units though, so that can be another test we could plan.

    For now, I don't think we can make a lot of progress without reproductions, and I think it's best to close the ticket (or put it on hold) with the assumption that a mechanical issue causes the problem.

    It does sound like you're on the right track for testing.

    Feel free to update this case at any point, if you have any questions or any new behavior/info pops up.

     

    Happy holidays!

     

    Cheers,

    Håkon

  • Just for completeness for anyone who might be interested:

    • we have not been able to reproduce the issue in the freezer
    • Someone has suggested that the issue might have been triggered (indirectly?) by a reset of the cell tower. The Mobile operator has suggested to do a "cancel location" of the sim card, but this was all perfectly captured by the modem firmware and higher-layer firmware so that is also not a trigger (it would be very surprising though, since the modem and sim card are unpowered during a reset).
Related