This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Fundamental eDRX design choices by Verizon breaking functionality

The Problem

We have been struggling with what appears to be a rather fundamental issue using MQTT and eDRX on Verizon. We've investigated this quite heavily, with traces taken from our nRF9160 and our MQTT broker, as well as by Verizon on their equipment. We now believe the crux of the issue is expressed rather well by this line in section 6.4 of the GSMA LTE-M Deployment Guide:

Currently there is no support for HLCom feature for LTE-M Deployments. Thus, it means that in case when a LTE-M device is in either PSM or eDRX, mobile terminating messages, depending on MNO choice the messages will either be buffered or discarded.

In other words, if you have a server trying to send TCP/IP traffic to a mobile device while it is in an eDRX sleep cycle, there is no requirement for the carrier to buffer the data, and they are totally free to discard it. One of our technical contacts at Verizon has confirmed that Verizon discards everything other than SMS, and he is not aware of any plans to enable buffering.

(AT&T does buffer TCP/IP, for at least some time duration, since these issues do not occur on that carrier.)

This issue wasn't immediately apparent, since our MQTT connections are over TCP, and the network stack on the server does normal TCP retries. Everything "just works" if one of the retries happens to be sent during the eDRX paging window. But as you increase the eDRX cycle time, the percentage of the time when the node is able to receive versus asleep becomes smaller and smaller. Worse, the way most TCP stacks implement retries, the retries become less and less frequent the longer it has been since the initial transmission. This means that for a message initially sent right at the beginning of the sleep cycle, you have even worse chances of a retry being delivered during the paging window as the eDRX cycle length increases.

The primary symptom that led us to investigate this is rather complicated.  After a server-to-mobile TCP message misses the eDRX window and extends out to very long/slow retries, the mobile device can still send messages to the server and the server will see it, but the server cannot send messages back to the mobile device. Looking at traces, what is actually happening is that the messages from the server are being received and buffered by the TCP stack on the mobile device but not passed to the application layer, since TCP mandates in-order reception. The TCP stack is still missing the earlier server-to-mobile message, and not handing up any data after that packet. Unfortunately, once in this state, the chance of the earlier missed message landing on an eDRX paging window become extremely small, and usually the TCP socket appears uni-directional until it times out.

One final symptom we see when a packet is discarded is that the Verizon towers will often page the mobile device to generate a service request, bringing it out of the normal eDRX cycle into a full RRC connection, but then not provide any data.  Nordic confirmed this from our traces in this ticket.  Verizon has confirmed that this is intentional, and they provide the symbolic page as an informative message that something was discarded.

How Do We Fix This?

First up, I understand that TCP was never intended for intermittent connections. Moving to UDP would at least fix the symptom of breaking traffic in one direction due to a lost packet. Unfortunately, MQTT requires TCP. The UDP alternative, MQTT-SN, would avoid this issue, but there is no readily-available open-source implementation yet and the development work to get there is not trivial. (The Oasis group did adopt MQTT-SN back in October, so it's not dead, just on everyone's back burner...)
 
One idea we've proven out by hand is that once we are in the broken state, we disable eDRX until we receive the next retry of the lost TCP packet, which will release all of the TCP data buffered in the modem's network stack and bi-directional communications begin working again.  Unfortunately, we (as the application) aren't always aware when we've missed a packet. A possible fix would be if the modem firmware could provide a notification to the application layer whenever it receives a paging request, or preferably a paging request without data.  We could then, at the application layer, disable eDRX for some time and/or send a MQTT keep-alive to the broker to ensure a working round-trip TCP path (i.e. no lost packets) before going back to low power mode.

Other ideas get more complicated, and involve having the broker send a SMS to the node in parallel to any MQTT data which would cause the node to stay out of eDRX until a round trip communication is done, or somehow having the server be guess/know when the node is in the paging window and time transmissions to then.  Verizon in particular is pushing for these latter options, but they are more complicated to implement.

I'm curious if the cellular experts at Nordic have any other ideas on ways to get around this problem.

My personal opinion of all this? Verizon is shooting themselves in the foot by making eDRX only useful for SMS, especially when their competitors are buffering TCP/IP, but I doubt anyone will change their minds.

Parents
  • One final symptom we see when a packet is discarded is that the Verizon towers will often page the mobile device to generate a service request, bringing it out of the normal eDRX cycle into a full RRC connection, but then not provide any data.

    This behavior is extremely inconsistent and frequently doesn't happen.  We're now wondering if we misunderstood our Verizon contact concerning this being intentional, or if it just behaves very different than we expect.  In either case, we don't think it will be usable as a hint that the server is trying to reach the mobile device.

Reply
  • One final symptom we see when a packet is discarded is that the Verizon towers will often page the mobile device to generate a service request, bringing it out of the normal eDRX cycle into a full RRC connection, but then not provide any data.

    This behavior is extremely inconsistent and frequently doesn't happen.  We're now wondering if we misunderstood our Verizon contact concerning this being intentional, or if it just behaves very different than we expect.  In either case, we don't think it will be usable as a hint that the server is trying to reach the mobile device.

Children
No Data
Related