NRF9160 NCS v1.7.0 CAT M1 Data Issue

Hello,

We are currently certifying our module for Verizon in the US.  As such, we are updating our software to work with MFW v1.3.0 and the corresponding NCS version v1.7.0.  The upgrade went relatively smoothing and things are generally working as we'd expect.  Our device will connect either over the CATM1 or NBIOT network (configurable) and then connect to an MQTT broker and send data up.  However, we have encountered an issue and require some assistance in troubleshooting it.

When we connect specifically over Vodafone (AT&T in my area) using CATM1 (using SYSTEMMODE set to both CATM1/NBIOT supported w/ LTE Preference = M1 Preferred), we will see our device successfully register to roaming.  It will then successfully open the TCP connection to the MQTT broker and send an MQTT Connect message.  However we never hear back from the server with the CONNACK.  Our device currently waits ~20 seconds before it gives up after not receiving a CONNACK.  Now running the EXACT same code and just swapping out the LTEPreference to "NB Preferred", we can successfully get a CONNACK and send/receive MQTT data.  When using the SAME MFW version (v1.3.0) and the older NCS version (v1.4.2) we can successfully connect over MQTT using CATM1.  This tells me its specifically something with the NCS version, likely something in the nrfxlib Modem Library.  I'm not sure if its necessarily a bug, but it could be some subtle way in which the modem library interacts with the Vodafone network while on CAT M1.

I've got some modem trace pcap files below that may help diagnose this issue.  In both tests, my device (IP 10.49.0.225) is connecting to server w/ IP 192.168.4.13.   You will need to filter the MQTT broker server pcap w/ "ip.src_host == 10.49.0.225 or ip.dst_host == 10.49.0.225". 

I've run two tests:

  1. Connect over NBIOT and successfully use MQTT
    1. Modem Trace = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_NB_5F00_Success_5F00_AT130.pcapng
    2. PCAP from MQTT Broker Server = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_NB_5F00_Success_5F00_DemoServer.pcap
  2. Connect over M1 and FAIL to use MQTT
    1. Modem Trace = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_M1_5F00_Failure_5F00_AT130.pcapng
    2. PCAP from MQTT Broker Server = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_M1_5F00_Failure_5F00_DemoServer.pcap

You can see in the NBIOT pcap files they connection and CONNACK happen as expected.  However with the M1 logs, you can see a couple things:

  1. The TCP SYN Handshake occur successfully. 
  2. The NRF9160 will send the MQTT Connect packet and it looks like i'd expect (screenshot 1). 
  3. Server logs (screenshot 2) looks like the TCP Payload is all zeros.  The other difference I could note is that when its received the "Differentiated Services Field" is set differently on the server side and looks invalid (or unknown).  

It ultimately seems like the NRF Modem Library version in conjunction with the Vodafone CATM1 network is causing us problems.  However I'm not sure how or why this could be happening.  I was hoping your team might have some insight on the issue?  Any insight or recommendations would be much appreciated!

SCREENSHOT 1

SCREENSHOT 2

Parents
  • Hi,

    I'll need the modem team to take a look at this, but I have a few questions first:

    Do you also have the raw modem traces? They contain more information than what is available in the .pcaps, and I wouldn't be surprised if the modem team requests raw traces.

    You say that you are working on Verizon certification, but from what I could see, you are using a Vodafone SIM card?

    Are you using the carrier library?

    You are also using modem fw v1.3.0 with rev1 HW, while the modem fw only has been certified with rev2 HW.

    If you also have traces where you change the NCS version, that would also be great, so that we can compare what happens in them.

    Best regards,

    Didrik

  • Hello Didrik,


    I've attached a raw modem trace of both the issue on NCS v1.7.0 and of a successfully connection on NCS v1.4.2.  Hopefully that gets you the info you need to do a comparison.

    You are correct, I am currently testing using a Vodafone SIM card.  I'm currently waiting on my VZW device, so I'm currently just testing out on the Vodafone network.  We'll have a common software between both our Verizon and Vodafone variants, so ultimately it needs to work for both network types, hence why I'm testing on Vodafone.

    Correct, we are building using the carrier lib.

    Correct, we are using rev1 HW in combination with the mfw v1.3.0.  We were given a waiver by Nordic to use Rev1 HW with this certified version of firmware, so we should be covered.

    Let me know if there's any other info you require.  This is a high priority issue, as it affects our Verizon certification deadline so any information you can provide would be much appreciated!  Thanks!

  • I've looked at the AT commands and socket calls in the two traces, and I've noticed some differences.

    First, both traces contains a lot of AT commands. The sequence is mostly the same, and I don't think the differences should have any meaningfull effect.

    However, when the application starts doing socket operations, things starts to differ.

    While the old application opens a socket, connects it and starts sending data to the server, the new application opens a socket, connect it, then closes it 20s later without sending any data. This sequence happens another time, before the the application opens the socket a third time, where it finally sends some data (the MQTT Connect command). No response is received in the next 20 seconds, so the application closes the socket. Then, ~6 seconds later, %CESQ reports that the modem has lost the LTE signal, although it is regained shortly after. But, this is enough to disconnect the device from the network. the modem tries to reattach with a Tracking Area Update, but is rejected with EMM reject cause 9 "UE identity cannot be derived by the network". It then tries to connect to other networks, but are rejected with either 15 "no suitable cells in tracking area" or 11 "PLMN not allowed".

    Do you know what causes the application to close the socket without sending any data?

    I don't see the same open/close pattern in the VZW_AT130_M1_Failure modemtrace, but I do a similar situation where %CESQ reports no signal, followed by +CGEV showing the PDN bearer has been deactivated, and a TAU reject with reject cause 9.

    I'll give these traces to the modem team as well, to see if they know the reason for the TAU reject, or if there is something I have missed.

  • Hello Didrik,

    I apologize, but I'm having problems following along.  Specifically I'm looking at the "NEW_NCS_AT130_M1_Failure.pcap" (attached below) generated using the "Trace Collector v2" from the modem trace of the same name above.  What I see is that immediately after the LTE connection is established, the TCP socket is opened and an "MQTT Connect" command is sent to the server.  It then sits there without a response. 

    This is ultimately the problem I'd like to focus on.  When I watched the "Connect"' command be received on the server, it had a "nulled" out payload and an invalid/unknown "Differentiated Services Field".  This explains why we're never getting a CONACK back from the server.  Maybe it has something to do with the "RRCConnectionRelease" that is received ~4 seconds later, but this is where I'd like input as I'm not terribly familiar with that level.  

    Ultimately using the same MQTT socket connection/creation code, it worked on NCS v1.4.2 and not NCS v1.7.0.  Hopefully that can help narrow down possibilities or give some insight into what we might be doing incorrectly. 

    As an FYI, the 20 seconds is the timeout we have for the MQTT connection.  If a CONACK is not received in 20 seconds, the MQTT session is closed.

    I look forward to hearing back from you and any insights you have into this problem.

    /cfs-file/__key/communityserver-discussions-components-files/4/NEW_5F00_NCS_5F00_AT130_5F00_M1_5F00_Failure.pcapng

  • Jameson said:
    I apologize, but I'm having problems following along.  Specifically I'm looking at the "NEW_NCS_AT130_M1_Failure.pcap" (attached below) generated using the "Trace Collector v2" from the modem trace of the same name above.  What I see is that immediately after the LTE connection is established, the TCP socket is opened and an "MQTT Connect" command is sent to the server.  It then sits there without a response. 

    You are correct in that it doesn't show up in the .pcap, but when I look at the communication between the application core and the modem, I see this pattern:

    14:57:59.929125 SOCKET_RPC_OPEN_REQ_MSG 
    14:57:59.929277 SOCKET_RPC_OPEN_RES_MSG err_code = 0, new_socket_id = 0
    14:57:59.929979 SOCKET_RPC_CONNECT_REQ_MSG socket_id = 0, destport = 1883, destaddr_len = 4
    14:58:00.296526 SOCKET_RPC_CONNECT_RES_MSG_CB socket_id = 0, err_code = 0
    14:58:20.309282 SOCKET_RPC_CLOSE_REQ_MSG socket_id = 0
    14:58:20.309679 SOCKET_RPC_CLOSE_RES_MSG socket_id = 0, err_code = 0

    Anyway, because it isn't in the other traces (including the failing ones), I don't think this is the cause of the problem.

    The attach and TAU rejects also does not seem related to your issue. Here is what the modem team had to say about it:

    "Attach/TAU rejects are not related to the original issue. Modem does Higher priority PLMN search (according to 3GPP, nothing special) and the highest priority PLMN in SIM card is found (310 260 - according to internet, i.e. not 100% sure, it would be SLO Cellular Inc / Cellular One of San Luis [This is probably a typo. When I look up 310 260 I get T-Mobile. SLO is 311 260. -Didrik]). However, the network rejects the TAU and also following Attach attempt. Then it takes some time before modem again finds AT&T and registers to network. All this can happen. Unfortunately no suitable tracking areas of the highest priority network (310 260) were found."

    I have also discussed this issue with the NCS team, but they also don't know what could cause this issue.

    So, here are some more questions to help us narrow down the cause:

    Do you see the same behaviour (not being able to send TCP data/connect to servers) with other samples (e.g. the mqtt_simple sample) in the SDK?

    Do you see the issue if you do not use your APN?

    Are you able to connect to other servers through your APN?

    Do you see the issue if you use another network?

    Do you see the issue if you use another SIM card?

    Ideally, there would be a way for us to reproduce the problem here in Norway or Finland, but depending on if the network, SIM card or APN is needed, that might not be possible.

    But, in case that is not possible, could you use MQTT sample as base and apply the necessary changes that provokes this scenario (e.g. custom APN, custom server). Then use git bisect on the underlying NCS until we hit the failing commit?

  • Hi, I just wanted to check how it is going?

    Have you been able to narrow down what parts are needed to reproduce this issue, e.g. the APN, SIM card, network, server, etc.?

    Or been able to reproduce it with a simpler application?

    Or find what commit to NCS introduces the problem?

    Is there something I can help you with?

    Best regards,

    Didrik

  • Hello Didrik,

    I've been able to isolate where the problem is occurring.  I built the MQTT example for the NRF9160 DK, pointed to our production servers, and used one of our production SIMs.  I did NOT see the issue occur using this method. 

    I was next able to build the MQTT example for the NRF9160 DK but switch out the UART lines in the DK device tree.  That allowed for me to run that build on our production device and at least validate the MQTT communcation to our production servers.  Interestingly, I did NOT experience the issue on this test either.  This tells me its something in the way our application is being built that is interacting with the MQTT library/modem library that is causing the problem.

    After various tests to compare the devkit vs our build, I was able to isolate the issue to be one of the following:

    • Something specific with our board
    • Something in our bootloader (we use SM, but do some additional validation on top of SPM)

    One thing I did identify is that our bootloader is currently giving us the error "Could not initialize secure services (err -28)".  I've so far been unsuccessful at determining the issue, but had a couple questions based on testing.

    • Do you think this issue might be caused by this secure services error?
    • Any thoughts to the -28 error and how we might resolve that?
    • Based on what I've narrowed it down to, any other thoughts as to what might be causing the issue?
Reply
  • Hello Didrik,

    I've been able to isolate where the problem is occurring.  I built the MQTT example for the NRF9160 DK, pointed to our production servers, and used one of our production SIMs.  I did NOT see the issue occur using this method. 

    I was next able to build the MQTT example for the NRF9160 DK but switch out the UART lines in the DK device tree.  That allowed for me to run that build on our production device and at least validate the MQTT communcation to our production servers.  Interestingly, I did NOT experience the issue on this test either.  This tells me its something in the way our application is being built that is interacting with the MQTT library/modem library that is causing the problem.

    After various tests to compare the devkit vs our build, I was able to isolate the issue to be one of the following:

    • Something specific with our board
    • Something in our bootloader (we use SM, but do some additional validation on top of SPM)

    One thing I did identify is that our bootloader is currently giving us the error "Could not initialize secure services (err -28)".  I've so far been unsuccessful at determining the issue, but had a couple questions based on testing.

    • Do you think this issue might be caused by this secure services error?
    • Any thoughts to the -28 error and how we might resolve that?
    • Based on what I've narrowed it down to, any other thoughts as to what might be causing the issue?
Children
  • Jameson said:
    Do you think this issue might be caused by this secure services error?

    One of the possible causes we have speculated on is that there is some timing difference triggering a race condition somewhere. I guess an error/early return might cause this timing difference.

    This is just speculation though. Normally, I would probably say that "no, as long as the data makes it to the modem, it should work", but that clearly isn't the case here. 

    Jameson said:
    Any thoughts to the -28 error and how we might resolve that?

    Assuming the error comes from spm_secure_services_init, the error code originally comes from nrf_cc3xx_platform_ctr_drbg_init. However, from what I can see from the ctr_drbg_init function, it should not return that error code.

    If it is a "normal" errno.h error code, it is ENOSPC "No space left on device". 

    I'll try to dig more into the ctr_drbg_init function on Monday.

    What "bootloader" are you using? What changes did you do to it?

  • Have you modified spm_secure_services_init somehow?

    Assuming that the -28 error is ENOSPC, and doesn't come from a library that uses different error codes, these are the most likely sources of the error:

    fprotect_area or fprotect_are_no_access

    sms_register_listener

    i2c_transfer

    jwt_init_builder

    fs_register

    nvs_write

    Are you using any of those?

  • Hello again. Have you made any progress?

    There is also a possibility that the -28 error comes from nrf_cc3xx_platform_ctr_drbg_init. To verify if that is the case, we need to see your project configuration.

    Could you share your .config or autoconf.h file?

  • Hi, how are the testing going?

    Would it be easier for you to share details if I make this ticket private?

    Best regards,

    Didrik

Related