This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Fundamental eDRX design choices by Verizon breaking functionality

The Problem

We have been struggling with what appears to be a rather fundamental issue using MQTT and eDRX on Verizon. We've investigated this quite heavily, with traces taken from our nRF9160 and our MQTT broker, as well as by Verizon on their equipment. We now believe the crux of the issue is expressed rather well by this line in section 6.4 of the GSMA LTE-M Deployment Guide:

Currently there is no support for HLCom feature for LTE-M Deployments. Thus, it means that in case when a LTE-M device is in either PSM or eDRX, mobile terminating messages, depending on MNO choice the messages will either be buffered or discarded.

In other words, if you have a server trying to send TCP/IP traffic to a mobile device while it is in an eDRX sleep cycle, there is no requirement for the carrier to buffer the data, and they are totally free to discard it. One of our technical contacts at Verizon has confirmed that Verizon discards everything other than SMS, and he is not aware of any plans to enable buffering.

(AT&T does buffer TCP/IP, for at least some time duration, since these issues do not occur on that carrier.)

This issue wasn't immediately apparent, since our MQTT connections are over TCP, and the network stack on the server does normal TCP retries. Everything "just works" if one of the retries happens to be sent during the eDRX paging window. But as you increase the eDRX cycle time, the percentage of the time when the node is able to receive versus asleep becomes smaller and smaller. Worse, the way most TCP stacks implement retries, the retries become less and less frequent the longer it has been since the initial transmission. This means that for a message initially sent right at the beginning of the sleep cycle, you have even worse chances of a retry being delivered during the paging window as the eDRX cycle length increases.

The primary symptom that led us to investigate this is rather complicated.  After a server-to-mobile TCP message misses the eDRX window and extends out to very long/slow retries, the mobile device can still send messages to the server and the server will see it, but the server cannot send messages back to the mobile device. Looking at traces, what is actually happening is that the messages from the server are being received and buffered by the TCP stack on the mobile device but not passed to the application layer, since TCP mandates in-order reception. The TCP stack is still missing the earlier server-to-mobile message, and not handing up any data after that packet. Unfortunately, once in this state, the chance of the earlier missed message landing on an eDRX paging window become extremely small, and usually the TCP socket appears uni-directional until it times out.

One final symptom we see when a packet is discarded is that the Verizon towers will often page the mobile device to generate a service request, bringing it out of the normal eDRX cycle into a full RRC connection, but then not provide any data.  Nordic confirmed this from our traces in this ticket.  Verizon has confirmed that this is intentional, and they provide the symbolic page as an informative message that something was discarded.

How Do We Fix This?

First up, I understand that TCP was never intended for intermittent connections. Moving to UDP would at least fix the symptom of breaking traffic in one direction due to a lost packet. Unfortunately, MQTT requires TCP. The UDP alternative, MQTT-SN, would avoid this issue, but there is no readily-available open-source implementation yet and the development work to get there is not trivial. (The Oasis group did adopt MQTT-SN back in October, so it's not dead, just on everyone's back burner...)
 
One idea we've proven out by hand is that once we are in the broken state, we disable eDRX until we receive the next retry of the lost TCP packet, which will release all of the TCP data buffered in the modem's network stack and bi-directional communications begin working again.  Unfortunately, we (as the application) aren't always aware when we've missed a packet. A possible fix would be if the modem firmware could provide a notification to the application layer whenever it receives a paging request, or preferably a paging request without data.  We could then, at the application layer, disable eDRX for some time and/or send a MQTT keep-alive to the broker to ensure a working round-trip TCP path (i.e. no lost packets) before going back to low power mode.

Other ideas get more complicated, and involve having the broker send a SMS to the node in parallel to any MQTT data which would cause the node to stay out of eDRX until a round trip communication is done, or somehow having the server be guess/know when the node is in the paging window and time transmissions to then.  Verizon in particular is pushing for these latter options, but they are more complicated to implement.

I'm curious if the cellular experts at Nordic have any other ideas on ways to get around this problem.

My personal opinion of all this? Verizon is shooting themselves in the foot by making eDRX only useful for SMS, especially when their competitors are buffering TCP/IP, but I doubt anyone will change their minds.

  • Okay, in my digging through various bits of Linux TCP documentation and source code, I've seen that receiving two ACKs for the same TCP sequence  number is taken by Linux (in most situations) as a hint to immediately re-transmit the packet just *after* that sequence number.  Maybe when our system detects it is in this packet-lost state, the modem could send an extra ACK to the previously ACKed sequence number. That should trigger an immediate resend from the server that would hopefully land in the current RRC period.

  • Another clarification: it looks like sometimes the Linux server stops sending new TCP data down to the mobile device when it knows that the prior packet hasn't been ACKed.  The end result looks similar, but in this case data is being buffered on the server's TCP stack instead of on the mobile client's stack...  This will make it harder for the mobile client to detect that it's in a lost packet state. Disappointed 

    Again, the best way to detect might be to send an application level message that should get an immediate response, like an MQTT keep-alive, and correlate the fact that the TCP sequence has been ACKed but the PINGRSP was never received as a hint that we're stuck.

    I've been playing with Linux TCP sysctls on our server today, so I'm not sure if this is changed behavior because of one of those or if I was just lucky before.

  • I believe that MQTT (at least over TCP) is fundamentally incompatible with eDRX, as even the networks that support buffering may limit the number of buffered packets. For example one network here in the Netherlands told us that they buffer a single packet per eDRX cycle, for up to the duration of one eDRX cycle and attempt to deliver it once at the very next PTW. If more than one packet is sent to the end device during the idle phase of the eDRX cycle, then a replace strategy is used within the buffer. I believe such behaviour will cause issues trying to maintain any TCP socket.

    I am curious how Verizon handles Non-IP Data Delivery with eDRX, as that might a better alternative to using SMS delivery.

  • One final symptom we see when a packet is discarded is that the Verizon towers will often page the mobile device to generate a service request, bringing it out of the normal eDRX cycle into a full RRC connection, but then not provide any data.

    This behavior is extremely inconsistent and frequently doesn't happen.  We're now wondering if we misunderstood our Verizon contact concerning this being intentional, or if it just behaves very different than we expect.  In either case, we don't think it will be usable as a hint that the server is trying to reach the mobile device.

  • I saw the CSCON suggestion from Nordic on this ticket, but am still hoping for more commentary or wisdom.

    The CSCON does flip to 1 when we get the symbolic page from Verizon, but the symbolic page doesn't seem totally reliable, and am hoping for either another method or some indication that the symbolic paging could be improved.

Related