MQTT publish POLLERR error

Hi,

I'm using nrf9160 MQTT at commands to send real-time data to the cloud. I run into the issue where the MQTT publish may fail (frequency varies around 1 out of 500 publishes). One thread suggests the MQTT fails if the message is too long, however, I don't think this is my case. Any suggestions on how to debug this issue?

Thanks!

Ye-Sheng

The debug message from the modem

This is the working one:

00> [00:18:17.373,504] <inf> slm_at_host: Enter datamode
00> [00:18:17.550,323] <inf> slm_at_host: time limit reached
00> [00:18:17.550,384] <inf> slm_at_host: Raw send 1479
00> [00:18:17.550,415] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x06]: >> Topic size 0x00000015, Data size 0x000005c7
00> [00:18:17.550,476] <dbg> net_mqtt_enc: pack_utf8_str: (0x2001b2b8): >> str_size:00000017 cur:0x2001f163, end:0x2001f35e
00> [00:18:17.550,506] <dbg> net_mqtt_enc: pack_uint16: (0x2001b2b8): >> val:0015 cur:0x2001f163, end:0x2001f35e
00> [00:18:17.550,506] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (0x2001b2b8): << msg type:0x30 length:0x000005de
00> [00:18:17.550,537] <dbg> net_mqtt_enc: packet_length_encode: (0x2001b2b8): >> length:0x000005de cur:(nil), end:(nil)
00> [00:18:17.550,537] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (0x2001b2b8): Fixed header length = 03
00> [00:18:17.550,537] <dbg> net_mqtt_enc: pack_uint8: (0x2001b2b8): >> val:30 cur:0x2001f160, end:0x2001f35e
00> [00:18:17.550,567] <dbg> net_mqtt_enc: packet_length_encode: (0x2001b2b8): >> length:0x000005de cur:0x2001f161, end:0x2001f35e
00> [00:18:17.550,567] <dbg> net_mqtt: client_write_msg: (0x2001b2b8): [0x2001b5cc]: Transport writing message.
00> [00:18:17.554,443] <dbg> net_mqtt: client_write_msg: (0x2001b2b8): [0x2001b5cc]: Transport write complete.
00> [00:18:17.554,473] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x06]: << result 0x00000000
00> [00:18:17.554,504] <inf> slm_mqtt: datamode send: 0
00> [00:18:18.542,633] <inf> slm_at_host: time limit reached
00> [00:18:18.542,724] <inf> slm_at_host: Raw send 1478
00> [00:18:18.542,724] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x06]: >> Topic size 0x00000015, Data size 0x000005c6
00> [00:18:18.542,785] <dbg> net_mqtt_enc: pack_utf8_str: (0x2001b2b8): >> str_size:00000017 cur:0x2001f163, end:0x2001f35e
00> [00:18:18.542,816] <dbg> net_mqtt_enc: pack_uint16: (0x2001b2b8): >> val:0015 cur:0x2001f163, end:0x2001f35e
00> [00:18:18.542,816] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (0x2001b2b8): << msg type:0x30 length:0x000005dd
00> [00:18:18.542,846] <dbg> net_mqtt_enc: packet_length_encode: (0x2001b2b8): >> length:0x000005dd cur:(nil), end:(nil)
00> [00:18:18.542,846] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (0x2001b2b8): Fixed header length = 03
00> [00:18:18.542,846] <dbg> net_mqtt_enc: pack_uint8: (0x2001b2b8): >> val:30 cur:0x2001f160, end:0x2001f35e
00> [00:18:18.542,877] <dbg> net_mqtt_enc: packet_length_encode: (0x2001b2b8): >> length:0x000005dd cur:0x2001f161, end:0x2001f35e
00> [00:18:18.542,877] <dbg> net_mqtt: client_write_msg: (0x2001b2b8): [0x2001b5cc]: Transport writing message.
00> [00:18:18.546,844] <dbg> net_mqtt: client_write_msg: (0x2001b2b8): [0x2001b5cc]: Transport write complete.
00> [00:18:18.546,875] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x06]: << result 0x00000000
00> [00:18:18.546,875] <inf> slm_mqtt: datamode send: 0
00> [00:18:18.546,905] <inf> slm_at_host: datamode off pending
00> [00:18:18.557,098] <inf> slm_at_host: Exit datamode

This is the failed one.
00> [00:18:19.397,705] <inf> slm_at_host: Enter datamode
00> [00:18:19.573,089] <inf> slm_at_host: time limit reached
00> [00:18:19.573,150] <inf> slm_at_host: Raw send 1466
00> [00:18:19.573,150] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x06]: >> Topic size 0x00000015, Data size 0x000005ba
00> [00:18:19.573,272] <dbg> net_mqtt_enc: pack_utf8_str: (0x2001b2b8): >> str_size:00000017 cur:0x2001f163, end:0x2001f35e
00> [00:18:19.573,303] <dbg> net_mqtt_enc: pack_uint16: (0x2001b2b8): >> val:0015 cur:0x2001f163, end:0x2001f35e
00> [00:18:19.573,303] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (0x2001b2b8): << msg type:0x30 length:0x000005d1
00> [00:18:19.573,333] <dbg> net_mqtt_enc: packet_length_encode: (0x2001b2b8): >> length:0x000005d1 cur:(nil), end:(nil)
00> [00:18:19.573,333] <dbg> net_mqtt_enc: mqtt_encode_fixed_header: (0x2001b2b8): Fixed header length = 03
00> [00:18:19.573,333] <dbg> net_mqtt_enc: pack_uint8: (0x2001b2b8): >> val:30 cur:0x2001f160, end:0x2001f35e
00> [00:18:19.573,364] <dbg> net_mqtt_enc: packet_length_encode: (0x2001b2b8): >> length:0x000005d1 cur:0x2001f161, end:0x2001f35e
00> [00:18:19.573,364] <dbg> net_mqtt: client_write_msg: (0x2001b2b8): [0x2001b5cc]: Transport writing message.
00> [00:18:19.577,117] <dbg> net_mqtt: client_write_msg: (0x2001b2b8): [0x2001b5cc]: Transport write complete.
00> [00:18:19.577,148] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x06]: << result 0x00000000
00> [00:18:19.577,148] <inf> slm_mqtt: datamode send: 0
00> [00:18:20.137,756] <err> slm_mqtt: POLLERR
00> [00:18:20.137,786] <inf> net_mqtt_sock_tls: Closing socket 0
00> [00:18:20.139,099] <inf> slm_mqtt: MQTT thread terminated
00> [00:18:20.176,391] <inf> slm_at_host: time limit reached
00> [00:18:20.176,452] <inf> slm_at_host: Raw send 11
00> [00:18:20.176,483] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x00]: >> Topic size 0x00000015, Data size 0x0000000b
00> [00:18:20.176,544] <dbg> net_mqtt: mqtt_publish: (0x2001b2b8): [CID 0x2001b5cc]:[State 0x00]: << result 0xffffff80
00> [00:18:20.176,574] <inf> slm_mqtt: datamode send: -128
00> [00:18:20.176,574] <wrn> slm_at_host: Raw send failed, 11 dropped
00> [00:18:20.176,605] <inf> slm_at_host: datamode off pending
00> [00:18:20.186,767] <inf> slm_at_host: Exit datamode

  • Hi Ye-Sheng,

    Thanks for checking with us. POLLERR means "Poll results in error conditions" in documents, so it does not give too much useful information. The error could be caused by a client, server, or something wrong with the connection.

    1) To help us get a better understanding of the behavior of the device. Could you record a modem trace including this failure? Please also share the whole log printout and tell us which MFW and NCS version you are using.

    2) It is also worth trying to test your client and server separately. like nRF9160 device+Publibe MQTT broker/ MQTT Client on PC+ You MQTT broker Cloud server.

    Best regards,

    Charlie

  • Hi Charlie,

    Thanks for the info. Since this is on my custom board, I had to use RTT backend to log it. Please see the attached file, and let me know what you think.

    Thanks!

    Ye-Sheng

    -- 

    I was using:

    modem FW: 1.3.5

    SDK: 2.4.0

    modem_trace0.bin

  • Hi Ye-Sheng,

    Thanks for sharing the modem trace. I will try to find someone to review it.

    Another finding is that the MQTT codes in SLM are using the absolute maximum value so I don’t know if it could trigger some undesired behavior in some edge cases. Our development team has got notified. I would suggest trying with a shorter keep-alive interval to see if that changes anything. So in slm_at_mqtt.c changing err = poll(&fds, 1, mqtt_keepalive_time_left(&client)); to e.g. err = poll(&fds, 1, mqtt_keepalive_time_left(&client) / 2);this is just a wild guess worth trying.

    I am still curious about whether this issue comes from the connection/server side, looking forward to seeing some test results from your side.

    Best regards,

    Charlie

  • Hi Charlie,

    Thanks for the suggestion.

    I couldn't tell the difference with the change, it still fails occasionally. We have other devices sending data to the server via WiFi, and we don't see this issue.

    another piece of information is the MQTT return code was -1, 113 (ECONNABORTED). Any other things I could try?

    Thanks!

  • Hi Ye-Sheng,

    Thanks for the feedback. We have reviewed your modem trace.

    Unfortunately, the trace only shows a working TLS connection, without any breaks, from the error case log, it is clear that the socket is closed, so this would be visible in the modem trace file if it happened during that time.

    In some implementations, POLLERR usually refers to TCP reset. I am not sure if that is case with our modem, but since the error comes in a situation that we have sent the MQTT package and are idle, I would suspect that the server closes the connection with TCP reset.

    There are multiple reasons why this could take place and I would recommend checking the logs in the server side.

    However, after the failure, trace shows a curious length for data send:

    00> [00:18:20.176,452] <inf> slm_at_host: Raw send 11

    If that is not intended, the reason for failure could be that the whole MQTT package was not in the data_buf when the timing was triggered and the remote end is terminating the connection due to receiving an incomplete Pub-message. Is there a possibility that the application/script that takes care of the data sending would send the data in parts?

    Another possibility would be an issue in UART, or SLM UART implementation. SLM UART implementation was (very) recently refactored, so you can give the SLM main branch a go, and see if that would help.

    You can also set CONFIG_NRF_MODEM_LOG=y to get libmodem’s logs to get more clues why poll() returns POLLERR.

    Best regards,

    Charlie

Related