Network stability debug help

We are running into an issue of stability with our LTE connection. Our nRF9160 makes a call every 5 minutes to our backend. On start-up, the device connects and starts making these calls.  After some time (90 minutes to 3 days) we stop seeing the communication to our backend. Once we power cycle the device, we start to see the calls coming in again. The 9160 application has the watchdog enabled, and is not hanging as I'm still seeing heartbeat print outs on RTT viewer.

I'm getting the system setup to collect the modem trace files with nRF Connect Trace Collector. Are there any additional debug outputs I should be collecting to help diagnose this network connection issue?

We're using:

NVS v2.1.2

MFW v1.3.3

Zephyr 3.1.99

On ATT network

Parents
  • Hello, 

    All logs along with modem traces are good for us to look into. Make sure that the logs correspond with the modem traces. What kind of application are you running on your device? Are you running this on custom HW or our development kit nRF9160DK?

    Thanks.

    Kind regards,
    Øyvind

  • Hello,

    Following up on this ticket. We've updated our code a good amount since the original posting, but are seeing possibly the same issue.

    We are seeing an issue where our LTE data stops reaching our backend. From the RTT viewer prints, the nRF9160 dropped the network connection for a period of ~70 minutes before reconnecting. We didn't have a trace being collected for this, but I've setup a test board with trace capture running to catch this.

    We have a custom board that has a nRF9160 and a nRF5340. POST and GET commands are sent from the 5340 via UART to the 9160, and the 9160 returns data from the calls to the 5340. To save battery, we suspend our UART peripherals on both IC's when not in use, then have shared GPIO act as a wakeup when one of the IC's has data to send over UART. We use the zephyr ‘pm_device_action_run()’ function to suspend and resume the UART peripherals.

    I will post a modem trace once it's captured, but in the meantime we're wondering 1.) if there are any known issues, or additional areas we should look into regarding the 9160 disconnecting from the network, and 2.) are there any known issues with the UART peripherals on either the 9160 or 5340 that could prevent them from exiting low power mode?

    nRF9160, nRF5340

    NCS v2.4.0

    MFW v1.3.5

    AT&T SIM

  • Eric, my sincere apologies for the late reply. Thanks for reaching out to your RSM! I forgot to answer you back in your last reply, but I forwarded to our modem team on the same day.

    I need to verify what the issue is in the last modem trace. It does look like the reject cause is 7 - EPS services not allowed. Will update within the day (Thursday Norwegian time). 

    Kind regards,
    Øyvind

  • Hi Eric, 

    Our modem team have been looking into the modem logs and provide the following feedback:

    The UE loses the AT&T cell as the coverage decreases/goes out of range. This can be due to e.g. interference. For the UE it takes some time to get in touch with the AT&T cell. During this time the UE attempts to connect neighboring cells (both T-Mo and Vzw) and those reject the UE with different EMM Causes depending when/how the UE attempts the attach. We are still working on the issue. Waiting for more feedback from our network experts. 

    Kind regards,
    Øyvind

  • Here is an update from our modem team. First from out carrier expert:

    UE tries first T-Mo (311-490) and does TAU. Since the MME in T-Mo network has not seen this UE before (UE's identity from the GUTI/S-TMSI/PTMSI in unknown) the MME responds with TAU REJECT Cause “Cause: UE identity cannot be derived by the network (9)”.

    Next the UE attempts attach to the same cell and T-Mo (311-490) .. and receives attach REJECT with Cause “Cause: PLMN not allowed (11)“ likely because roaming in the T-Mo network is not allowed to this subscription

    After this the UE attempts again a new AT&T cell but on a FirstNet PLMN.. and gets REJECT with Cause “Cause: No Suitable Cells In tracking area (15)“. I believe the subscription has no FirstNet provisioned

    UE attempts again T-Mo but in another cell gets another REJECT with Cause “Cause: PLMN not allowed (11)“. Likely no roaming in T-Mo network allowed for this subscription.

    Then UE attempts Vzw (311-480) and obviously gets a REJECT with Cause “Cause: PLMN not allowed (11)“. Likely no roaming in Vzw network allowed for this subscription

    Then UE attempts to the same AT&T cell it recently got rejected but on non-FirstNet PLMN and this succeeds. The service is resumed.

    Based on the above the UE works as expected.

    Then our network specialist answered:

    The root cause here seems to be the failing RRC connection establishments all of a sudden. At least mapped RSRP is very good at all time, and if the use case is a lock then we're assuming the device doesn’t move at all

    One example of continuous lower layer failures in RRC connection establishment:

    08:47:26.976961  NAS_PDU_SERVICE_REQUEST [c75cb551]
    08:47:31.023805  ERRC_EST_REJ_s { header : { msg_id : ERRC_EST_REJ, sender : TASK_ERRC, receiver : TASK_EMMSM } }
    08:47:31.024019  NAS_PDU_SERVICE_REQUEST [c75cb551]
    08:47:33.034853  ERRC_EST_REJ_s { header : { msg_id : ERRC_EST_REJ, sender : TASK_ERRC, receiver : TASK_EMMSM } }
    08:47:33.035097  NAS_PDU_SERVICE_REQUEST [c75cb551]
    08:47:35.045930  ERRC_EST_REJ_s { header : { msg_id : ERRC_EST_REJ, sender : TASK_ERRC, receiver : TASK_EMMSM } } 

    Due to these failures, as per 3GPP, the modem attempts to connect to other networks and gets rejected as expected. In the end, the modem returns to AT&T. The lower layer failures have disappeared and everything seems to work smoothly.

    Routing to L1 for investigations.

    Kind regards,
    Øyvind

  • Thank you Øyvind,

    Does the trace have timestamps of when these UE connection attempts occur? As well as the last network communication on that trace?

    What we’re seeing is the UE (nRF9160) becomes non responsive, and the watchdog does not reset the device. We are only able to regain network connectivity after we reboot the device. Your assessment says the modem successfully reconnects with AT&T at the end, but we are not seeing this.

    Also, the last line from the network specialist is 'Routing to L1 for investigation', what does this mean?

    All the best,

    Eric

  • Hi Eric, 

    I will ask our team to provide logs with time stamps. But might not have any before Monday.

    ERob said:
    Also, the last line from the network specialist is 'Routing to L1 for investigation', what does this mean?

    Sorry for the confusion. There are several layers inside of the modem based on 3GPP.  L1 is the "Physical Layer" of modem, and there is a team of experts who will look into this. L2 is the "Data Link Layer". FYI, this is more internal, but still an important part of the investigation and good to know in regards to the time it takes.

    There are still ongoing discussions to why the device fails to connect the cell, as this is not clear in the modem trace.

    Connection establishments probably fail because there is disturbing neighbor cells. RSRP’s of the cells are very good but SNR is weak. Because of good RSRP, repetitions or ce-level 1 are not triggered.

    Have you done any HW review of your design? The antenna design might affect the performance. If not done, would it be possible for you to upload your design files? I can forward to our HW design experts to verify if the antenna design is sufficient. 

    ERob said:
    e are only able to regain network connectivity after we reboot the device. Your assessment says the modem successfully reconnects with AT&T at the end, but we are not seeing this.

    I will ask our modem team about this as well. 

    Kind regards,
    Øyvind

Reply
  • Hi Eric, 

    I will ask our team to provide logs with time stamps. But might not have any before Monday.

    ERob said:
    Also, the last line from the network specialist is 'Routing to L1 for investigation', what does this mean?

    Sorry for the confusion. There are several layers inside of the modem based on 3GPP.  L1 is the "Physical Layer" of modem, and there is a team of experts who will look into this. L2 is the "Data Link Layer". FYI, this is more internal, but still an important part of the investigation and good to know in regards to the time it takes.

    There are still ongoing discussions to why the device fails to connect the cell, as this is not clear in the modem trace.

    Connection establishments probably fail because there is disturbing neighbor cells. RSRP’s of the cells are very good but SNR is weak. Because of good RSRP, repetitions or ce-level 1 are not triggered.

    Have you done any HW review of your design? The antenna design might affect the performance. If not done, would it be possible for you to upload your design files? I can forward to our HW design experts to verify if the antenna design is sufficient. 

    ERob said:
    e are only able to regain network connectivity after we reboot the device. Your assessment says the modem successfully reconnects with AT&T at the end, but we are not seeing this.

    I will ask our modem team about this as well. 

    Kind regards,
    Øyvind

Children
  • Hey Øyvind,

    This antenna design was tested in a CTIA certified lab and met all of our performance requirements. The antenna has not changed since our initial installation, so it does not explain why this issue started happening after 3 months of successful device operation. We do not believe that RF noise is the reason for this issue.

    Looking at your previous message, we have the following comments. It is strange the device is attempting TAU on T-Mo. The device should not attempt this unless it is registered to the network, and it should not be able to register to a T-Mo network with the AT&T SIM that is installed. Also, shouldn’t the modem filter the available towers by PLMN, which would prioritize the AT&T towers? It is strange that the device is attempting to attach to T-Mo at all, and especially strange that it is attempting to attach to VZW which uses band 13 instead of band 12.

    Were you able to get any timestamp data? I'm not seeing any datetime data in the trace when viewing in nRF Connect or Wireshark.

    All the best,

    Eric

  • Eric, 

    ERob said:
    The device should not attempt this unless it is registered to the network, and it should not be able to register to a T-Mo network with the AT&T SIM that is installed.

    I'm working with our modem team to find the answer to why this. It is not unusual for a device to try to connect other cell towers, but the unusual is that your stationary device does this. 

    ERob said:
    This antenna design was tested in a CTIA certified lab and met all of our performance requirements. The antenna has not changed since our initial installation, so it does not explain why this issue started happening after 3 months of successful device operation. We do not believe that RF noise is the reason for this issue.

    Thanks for clarifying. We are looking into all potential issues to help you out of this situation.

    ERob said:
    Were you able to get any timestamp data? I'm not seeing any datetime data in the trace when viewing in nRF Connect or Wireshark

    Still working on this on my side. If you are not able to see this in wireshark after converting Cellular Monitor, then we need to look at other posibilities.

    Kind regards,
    Øyvind 

  • Hey  , small update from our side. The issue is still not clear to what is causing the disconnects. Our team in the US are looped in to perform more tests in the area of the failing devices. Will update as soon as I have more information/feedback from the team. 

    Thanks for your patience, and we apologize for the inconvenience this is causing. 

    Kind regards,
    Øyvind

  • Hey Oyvind, any update on this investigation?

  • Hello Eric, from what I was informed and have understood, this issue is related to your application and it was fixed in the beginning of this year. Due to that I think we should close this DevZone ticket, and if there still are issues on your side please register a new DevZone ticket with relevant modem traces and application logs from your failing devices.

    Kind regards,
    Øyvind

Related