Sim card issues (CEREG 90), possibly due to freezing weather

Hello,

We have a set of ~10 units in the field (outside) in Belgium, which communicate via NB-IoT, and do an MQTT sync procedure every 4 hours.

During the freezing period of around December 15th, 2 units went offline for about a week (the temperature was around -5 degrees). The network operator (Orange) did not see anything specific occurring. The units have reset themselves periodically (part of the firmware). They "disappeared" not on the same day.

1 of the 2 units resumed some communication on december 19th, but not with the correct intervals (so it definitely still had some trouble). I went on site, and noticed LTE connect errors pointing to "+CEREG: 90" which means issues with the UICC (SIM card). Several manual resets (triggered via the reset line, not by unpowering the system) did not solve the issue. Eventually, I took out the sim card from the slot for a second, and put it back in and it all started working fine.

The second unit never managed to communicate autonomously, but after an on-site intervention, the sim card slot was also opened and re-closed, and the issue was fixed after a reset.

My first assumption is that the cold might have caused some issues: our PCB is in a casing but not yet fully coated / protected. However, it's very strange that the systems couldn't fully recover. So my doubts:

  • We never triggered a full power-cycle of the system, since it's battery based and includes supercapacitors so it's not that easy. Could this have made a difference?
  • Is it possible that the SIM cards or another part of the electronics get stuck after e.g. a load of invalid commands/responses, and that a power cycle is required to fix the issue? (I have to verify if the 1.8V effectively goes down on a reset).
  • I'm using Modem FW 1.3.0 and SDK 1.7.1, which are certainly not the latest ones. Would it make sense to upgrade these?
  • Is it possible to reset the sim card via a firmware command? Or would that be done intrinsically on e.g. lte_lc_offline and "AT+CFUN=0"?
  • Since it's clearly not easy to reproduce this issue, have there been any similar issues or can you think of a way to reproduce the issue?

Again, the cold temperatures might be related to the issue, since these units have been running for about 6 months without this specific issue, so it would be quite a coincidence that 2 units have the same issue in a time window of only 1 or 2 days, and for a duration of at least 1 week.

Best regards,

Sebastiaan

Parents
  • Hi,

     

    1 of the 2 units resumed some communication on december 19th, but not with the correct intervals (so it definitely still had some trouble). I went on site, and noticed LTE connect errors pointing to "+CEREG: 90" which means issues with the UICC (SIM card). Several manual resets (triggered via the reset line, not by unpowering the system) did not solve the issue. Eventually, I took out the sim card from the slot for a second, and put it back in and it all started working fine.

    CEREG: 90 points to communication problems with the SIM itself.

    This sounds like a mechanical issue with the SIM card holder itself. Have you tried replacing the SIM holder to see if this has an effect on the stability over temperature?

    Is it possible that the SIM cards or another part of the electronics get stuck after e.g. a load of invalid commands/responses, and that a power cycle is required to fix the issue? (I have to verify if the 1.8V effectively goes down on a reset).

    The supply voltage for the SIM goes off after a given time-period after enabling eDRX or PSM, depending on what your SIM supports (if it supports power down).

    I'm using Modem FW 1.3.0 and SDK 1.7.1, which are certainly not the latest ones. Would it make sense to upgrade these?

    I do not think that this is a root-cause of the problems you're seeing.

    Is it possible to reset the sim card via a firmware command? Or would that be done intrinsically on e.g. lte_lc_offline and "AT+CFUN=0"?

    If you use PSM/eDRX actively in your application, the SIM communication will only occur on-demand, meaning that the SIM will effectively be reset for each time you communicate with it.

    Since it's clearly not easy to reproduce this issue, have there been any similar issues or can you think of a way to reproduce the issue?

    Did both devices have the exact same problem, ie. that CEREG: 90 was logged as the failure reason?

    since it's battery based and includes supercapacitors so it's not that easy.

    Can you share information related to the battery type that you're using, and the supercaps on your board?

     

    Kind regards,

    Håkon

Reply
  • Hi,

     

    1 of the 2 units resumed some communication on december 19th, but not with the correct intervals (so it definitely still had some trouble). I went on site, and noticed LTE connect errors pointing to "+CEREG: 90" which means issues with the UICC (SIM card). Several manual resets (triggered via the reset line, not by unpowering the system) did not solve the issue. Eventually, I took out the sim card from the slot for a second, and put it back in and it all started working fine.

    CEREG: 90 points to communication problems with the SIM itself.

    This sounds like a mechanical issue with the SIM card holder itself. Have you tried replacing the SIM holder to see if this has an effect on the stability over temperature?

    Is it possible that the SIM cards or another part of the electronics get stuck after e.g. a load of invalid commands/responses, and that a power cycle is required to fix the issue? (I have to verify if the 1.8V effectively goes down on a reset).

    The supply voltage for the SIM goes off after a given time-period after enabling eDRX or PSM, depending on what your SIM supports (if it supports power down).

    I'm using Modem FW 1.3.0 and SDK 1.7.1, which are certainly not the latest ones. Would it make sense to upgrade these?

    I do not think that this is a root-cause of the problems you're seeing.

    Is it possible to reset the sim card via a firmware command? Or would that be done intrinsically on e.g. lte_lc_offline and "AT+CFUN=0"?

    If you use PSM/eDRX actively in your application, the SIM communication will only occur on-demand, meaning that the SIM will effectively be reset for each time you communicate with it.

    Since it's clearly not easy to reproduce this issue, have there been any similar issues or can you think of a way to reproduce the issue?

    Did both devices have the exact same problem, ie. that CEREG: 90 was logged as the failure reason?

    since it's battery based and includes supercapacitors so it's not that easy.

    Can you share information related to the battery type that you're using, and the supercaps on your board?

     

    Kind regards,

    Håkon

Children
  • Hi Håkon,

    I'll let our HW engineer answer on the questions about battery type and supercaps, but I already can tell you that the powering circuit is not that easy. Main battery is https://www.rrc-ps.com/power-management/batterieladegeraete/produkt/RRC2130, so dual-cell Li-ion, 4170 mAh. There's a backup battery (single-cell, 1800 mAh) but the backup battery did not have to take over during most of the 1-week outage.

    Both devices had exactly the same problem (CEREG:90), the second one was even fixed by just opening and closing the sim socket (not explicitly taking out the card) followed by a reset.

    Given your information and Torje's shared experience above, a mechanical issue is indeed more likely (possibly caused indirectly by the cold?)

  • Hi,
    That is my department :).
    The NRF is fed by two battery's, one is a hefty 2S LIPO of about 37Wh and the second is a smaller 1S LIPO which takes over in case of removal of the Main battery.
    Both of them go through a LTC3621 step-down and that 3V3 output is stabilized by a 1F capacitor Digikey to compensate for heavy current dips caused by a small DC motor further down the line.
    Capacitors on the NRF have been designed as per datasheet and equal to the eval board.

  • Hi,

     

    Thank you for providing such detailed information. I just wanted to make sure that the battery was able to provide sufficient peak outputs, even at lower temps.

    Based on the battery description and specification, it should have no problem driving the nrf9160, even at lower temperatures.

    Sebastiaan Merckx said:
    Both devices had exactly the same problem (CEREG:90), the second one was even fixed by just opening and closing the sim socket (not explicitly taking out the card) followed by a reset.

    thank you for confirming.

    Does your firmware handle the "CEREG: 90" notification? What it could do is to catch this scenario, and try again x minutes later. This is however not a full fix (which would be to check or replace the SIM holder), but a way to detect the scenario.

     

    Kind regards,

    Håkon

  • Hello Håkon,

    I don't handle the "CEREG: 90" notification specifically, but I have more generic recovery scenarios (e.g. first connection attempt can take 1 minute, then 4 minutes, then 7 minutes, with a "lte_lc_offline" and a "lte_lc_deinit" in between, and cold system resets if there is still no connection after the 3rd attempt.

    So I'm pretty sure that these devices have tried hundreds of times. You are right that it could help to implement a specific detection mechanism to get a better view of how often this occurs in the field (on other devices). I'll put that on low prio though.

    I also have a device running for almost 24 hours in the freezer now (yeah, seriously Slight smile) and syncing every 5 minutes which works fine so far. It's not one of the faulty 2 units though, so that can be another test we could plan.

    For now, I don't think we can make a lot of progress without reproductions, and I think it's best to close the ticket (or put it on hold) with the assumption that a mechanical issue causes the problem.

  • Hi,

     

    Sebastiaan Merckx said:

    I don't handle the "CEREG: 90" notification specifically, but I have more generic recovery scenarios (e.g. first connection attempt can take 1 minute, then 4 minutes, then 7 minutes, with a "lte_lc_offline" and a "lte_lc_deinit" in between, and cold system resets if there is still no connection after the 3rd attempt.

    So I'm pretty sure that these devices have tried hundreds of times. You are right that it could help to implement a specific detection mechanism to get a better view of how often this occurs in the field (on other devices). I'll put that on low prio though.

    Ok, that is good to know! What's important is that you are able to recover from the scenario, and it sounds like you have good control over that part!

     

    Sebastiaan Merckx said:

    I also have a device running for almost 24 hours in the freezer now (yeah, seriously Slight smile) and syncing every 5 minutes which works fine so far. It's not one of the faulty 2 units though, so that can be another test we could plan.

    For now, I don't think we can make a lot of progress without reproductions, and I think it's best to close the ticket (or put it on hold) with the assumption that a mechanical issue causes the problem.

    It does sound like you're on the right track for testing.

    Feel free to update this case at any point, if you have any questions or any new behavior/info pops up.

     

    Happy holidays!

     

    Cheers,

    Håkon

Related