MQTT Not Receiving Subscribe Message for 30+ Seconds

My setup:
SDK version: 2.5.0.v
Modem version: 1.3.0.


I am using MQTT to subscribe to a topic, for example, the LED state. The plan is to set the LED state on the server side and send the new state to the device. I need a relatively fast reaction time, so I am not using any power-saving modes (e.g., eDRX or PSM). Usually, I receive this message within 2.54 seconds, as the RRC idle time PDCCH check occurs every 2.54 seconds. I am using an LTE-M network, and my MQTT keep-alive time is 3 minutes.


This setup works well most of the time. However, occasionally, I experience a long delay in receiving the message—between 30 and 60 seconds, which is unacceptable for my use case.
What could be causing this delay?

I also have captured the trace when this problem happens. 

1. Between timestamps 606 and 744, I published the new state from the server, but the message was not received.

2. At timestamps 754 and 757, I sent the button state from the device to the server to check if the TCP communication was still open. The server successfully received the device state.

3. Finally, at timestamp 793, I received the delayed publish message from the server.


PS:
In the trace file, I also see a lot of TCP retransmission errors, etc. Is this normal?

PS1:
I also check MQTT broker debug log on the server side, I did not see any delays etc. Everything looks OK. 

UPDATE:
I found one more interesting thing. The modem is constantly switching between two towers. The SNR of one tower is somewhere between 8 dB (fair) and 10 dB (good), while the other tower's SNR ranges from -5 dB to 0 dB (poor). When the device missed the message, it was connected to the tower with poor SNR. This could explain why the MQTT message was not received. 



 7838.output2.pcapng

  • From modem team discussion;

    "Both cells customer mentions seem to be usable from L1 point of view (Pagings received in RRC Idle and RRCConnectionReleases received in RRC Connected mode)."

    "Cell PCI 354 is clearly weaker and SNR is worse. This cell is anyway selected always by the cell search/init sync algorithm after RRC Connection. In RRC Idle re-selection algorithm then changes back to better cell PCI 406. This is most likely normal behavior as both cells are good enough."

    "NMEAS sees about identical RSRP and RSRQ for PCIs 354 and 406 - they fluctuate mostly within +-3 dB. In the last quarter of the log PCI 354 has clearly worse RSRP/RSRQ few times. So yes - you would see many reselections between the two."

    "After changing to PCI 406 no pages are received for the device until first mobile originated connection. It seems that network is not sending DL data to correct tower. Or we are missing pages. "

    Ok, so UE is camped to better cell when data is not received. UE receives pages for other devices and SNR is >0 so it’s very unlikely that UE would miss pages. So, most probably the reason is like you mentioned: “It seems that network is not sending DL data to correct tower”

    "UE is connected to PCI 354 and after RRC Connection is released, UE changes to PCI 406 so likely NW keeps sending pages for UE only to PCI 354 and not for 406. Sounds like NW configuration problem."

    "According to cellmapper.net both cells are in the same tower 442 near St. Gertrude's New Church, Riga. The strange thing is that in the same tower there is cells 354, 355, 356 forming probably 3 different sectors - and in addition - cell 406. Might be a misconfiguration problem on operator side."

Related