NRF9160 NCS v1.7.0 CAT M1 Data Issue

Hello,

We are currently certifying our module for Verizon in the US.  As such, we are updating our software to work with MFW v1.3.0 and the corresponding NCS version v1.7.0.  The upgrade went relatively smoothing and things are generally working as we'd expect.  Our device will connect either over the CATM1 or NBIOT network (configurable) and then connect to an MQTT broker and send data up.  However, we have encountered an issue and require some assistance in troubleshooting it.

When we connect specifically over Vodafone (AT&T in my area) using CATM1 (using SYSTEMMODE set to both CATM1/NBIOT supported w/ LTE Preference = M1 Preferred), we will see our device successfully register to roaming.  It will then successfully open the TCP connection to the MQTT broker and send an MQTT Connect message.  However we never hear back from the server with the CONNACK.  Our device currently waits ~20 seconds before it gives up after not receiving a CONNACK.  Now running the EXACT same code and just swapping out the LTEPreference to "NB Preferred", we can successfully get a CONNACK and send/receive MQTT data.  When using the SAME MFW version (v1.3.0) and the older NCS version (v1.4.2) we can successfully connect over MQTT using CATM1.  This tells me its specifically something with the NCS version, likely something in the nrfxlib Modem Library.  I'm not sure if its necessarily a bug, but it could be some subtle way in which the modem library interacts with the Vodafone network while on CAT M1.

I've got some modem trace pcap files below that may help diagnose this issue.  In both tests, my device (IP 10.49.0.225) is connecting to server w/ IP 192.168.4.13.   You will need to filter the MQTT broker server pcap w/ "ip.src_host == 10.49.0.225 or ip.dst_host == 10.49.0.225". 

I've run two tests:

  1. Connect over NBIOT and successfully use MQTT
    1. Modem Trace = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_NB_5F00_Success_5F00_AT130.pcapng
    2. PCAP from MQTT Broker Server = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_NB_5F00_Success_5F00_DemoServer.pcap
  2. Connect over M1 and FAIL to use MQTT
    1. Modem Trace = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_M1_5F00_Failure_5F00_AT130.pcapng
    2. PCAP from MQTT Broker Server = /cfs-file/__key/communityserver-discussions-components-files/4/VZW_5F00_AT130_5F00_M1_5F00_Failure_5F00_DemoServer.pcap

You can see in the NBIOT pcap files they connection and CONNACK happen as expected.  However with the M1 logs, you can see a couple things:

  1. The TCP SYN Handshake occur successfully. 
  2. The NRF9160 will send the MQTT Connect packet and it looks like i'd expect (screenshot 1). 
  3. Server logs (screenshot 2) looks like the TCP Payload is all zeros.  The other difference I could note is that when its received the "Differentiated Services Field" is set differently on the server side and looks invalid (or unknown).  

It ultimately seems like the NRF Modem Library version in conjunction with the Vodafone CATM1 network is causing us problems.  However I'm not sure how or why this could be happening.  I was hoping your team might have some insight on the issue?  Any insight or recommendations would be much appreciated!

SCREENSHOT 1

SCREENSHOT 2

  • Hello Didrik,

    I apologize, but I'm having problems following along.  Specifically I'm looking at the "NEW_NCS_AT130_M1_Failure.pcap" (attached below) generated using the "Trace Collector v2" from the modem trace of the same name above.  What I see is that immediately after the LTE connection is established, the TCP socket is opened and an "MQTT Connect" command is sent to the server.  It then sits there without a response. 

    This is ultimately the problem I'd like to focus on.  When I watched the "Connect"' command be received on the server, it had a "nulled" out payload and an invalid/unknown "Differentiated Services Field".  This explains why we're never getting a CONACK back from the server.  Maybe it has something to do with the "RRCConnectionRelease" that is received ~4 seconds later, but this is where I'd like input as I'm not terribly familiar with that level.  

    Ultimately using the same MQTT socket connection/creation code, it worked on NCS v1.4.2 and not NCS v1.7.0.  Hopefully that can help narrow down possibilities or give some insight into what we might be doing incorrectly. 

    As an FYI, the 20 seconds is the timeout we have for the MQTT connection.  If a CONACK is not received in 20 seconds, the MQTT session is closed.

    I look forward to hearing back from you and any insights you have into this problem.

    /cfs-file/__key/communityserver-discussions-components-files/4/NEW_5F00_NCS_5F00_AT130_5F00_M1_5F00_Failure.pcapng

  • Jameson said:
    I apologize, but I'm having problems following along.  Specifically I'm looking at the "NEW_NCS_AT130_M1_Failure.pcap" (attached below) generated using the "Trace Collector v2" from the modem trace of the same name above.  What I see is that immediately after the LTE connection is established, the TCP socket is opened and an "MQTT Connect" command is sent to the server.  It then sits there without a response. 

    You are correct in that it doesn't show up in the .pcap, but when I look at the communication between the application core and the modem, I see this pattern:

    14:57:59.929125 SOCKET_RPC_OPEN_REQ_MSG 
    14:57:59.929277 SOCKET_RPC_OPEN_RES_MSG err_code = 0, new_socket_id = 0
    14:57:59.929979 SOCKET_RPC_CONNECT_REQ_MSG socket_id = 0, destport = 1883, destaddr_len = 4
    14:58:00.296526 SOCKET_RPC_CONNECT_RES_MSG_CB socket_id = 0, err_code = 0
    14:58:20.309282 SOCKET_RPC_CLOSE_REQ_MSG socket_id = 0
    14:58:20.309679 SOCKET_RPC_CLOSE_RES_MSG socket_id = 0, err_code = 0

    Anyway, because it isn't in the other traces (including the failing ones), I don't think this is the cause of the problem.

    The attach and TAU rejects also does not seem related to your issue. Here is what the modem team had to say about it:

    "Attach/TAU rejects are not related to the original issue. Modem does Higher priority PLMN search (according to 3GPP, nothing special) and the highest priority PLMN in SIM card is found (310 260 - according to internet, i.e. not 100% sure, it would be SLO Cellular Inc / Cellular One of San Luis [This is probably a typo. When I look up 310 260 I get T-Mobile. SLO is 311 260. -Didrik]). However, the network rejects the TAU and also following Attach attempt. Then it takes some time before modem again finds AT&T and registers to network. All this can happen. Unfortunately no suitable tracking areas of the highest priority network (310 260) were found."

    I have also discussed this issue with the NCS team, but they also don't know what could cause this issue.

    So, here are some more questions to help us narrow down the cause:

    Do you see the same behaviour (not being able to send TCP data/connect to servers) with other samples (e.g. the mqtt_simple sample) in the SDK?

    Do you see the issue if you do not use your APN?

    Are you able to connect to other servers through your APN?

    Do you see the issue if you use another network?

    Do you see the issue if you use another SIM card?

    Ideally, there would be a way for us to reproduce the problem here in Norway or Finland, but depending on if the network, SIM card or APN is needed, that might not be possible.

    But, in case that is not possible, could you use MQTT sample as base and apply the necessary changes that provokes this scenario (e.g. custom APN, custom server). Then use git bisect on the underlying NCS until we hit the failing commit?

  • Hi, I just wanted to check how it is going?

    Have you been able to narrow down what parts are needed to reproduce this issue, e.g. the APN, SIM card, network, server, etc.?

    Or been able to reproduce it with a simpler application?

    Or find what commit to NCS introduces the problem?

    Is there something I can help you with?

    Best regards,

    Didrik

  • Hello Didrik,

    I've been able to isolate where the problem is occurring.  I built the MQTT example for the NRF9160 DK, pointed to our production servers, and used one of our production SIMs.  I did NOT see the issue occur using this method. 

    I was next able to build the MQTT example for the NRF9160 DK but switch out the UART lines in the DK device tree.  That allowed for me to run that build on our production device and at least validate the MQTT communcation to our production servers.  Interestingly, I did NOT experience the issue on this test either.  This tells me its something in the way our application is being built that is interacting with the MQTT library/modem library that is causing the problem.

    After various tests to compare the devkit vs our build, I was able to isolate the issue to be one of the following:

    • Something specific with our board
    • Something in our bootloader (we use SM, but do some additional validation on top of SPM)

    One thing I did identify is that our bootloader is currently giving us the error "Could not initialize secure services (err -28)".  I've so far been unsuccessful at determining the issue, but had a couple questions based on testing.

    • Do you think this issue might be caused by this secure services error?
    • Any thoughts to the -28 error and how we might resolve that?
    • Based on what I've narrowed it down to, any other thoughts as to what might be causing the issue?
  • Jameson said:
    Do you think this issue might be caused by this secure services error?

    One of the possible causes we have speculated on is that there is some timing difference triggering a race condition somewhere. I guess an error/early return might cause this timing difference.

    This is just speculation though. Normally, I would probably say that "no, as long as the data makes it to the modem, it should work", but that clearly isn't the case here. 

    Jameson said:
    Any thoughts to the -28 error and how we might resolve that?

    Assuming the error comes from spm_secure_services_init, the error code originally comes from nrf_cc3xx_platform_ctr_drbg_init. However, from what I can see from the ctr_drbg_init function, it should not return that error code.

    If it is a "normal" errno.h error code, it is ENOSPC "No space left on device". 

    I'll try to dig more into the ctr_drbg_init function on Monday.

    What "bootloader" are you using? What changes did you do to it?

Related