AWS IoT MQTT disconnect while awaiting acknowledgement

I'm using nRF Connect SDK v2.0.x with a Thingy 91. I have established my own AWS account, created a thing with the suggested security policy, and produced certificates per the instructions at https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/nrf/libraries/networking/aws_iot.html. Using the instructions given at https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/nrf/samples/nrf9160/aws_iot/README.html I have provisioned  my thingy 91 and then correctly set the options in the sample's prj.conf to reflect my AWS endpoint and thing name. I compile and flash with no problems.

My application collects data from a sensor and sends it to my AWS endpoint. I require a Quality of Service of 1, so the firmware waits for an acknowledgment after returning from the aws_iot_send function. The app has successfully published. But lately between the time the aws_iot_send function returns successfully and AWS sends an ACK, an AWS_IOT_EVT_DISCONNECTED occurs. Because of the disconnect I never get an acknowledgement and my app fails.

This is a debug log of what is occurring:

[5s] Starting LED Task.

[5s] Starting LED Control Task.

[5s] Starting Button Task.

[5s] Starting Microphone Task.

[5s] Starting LTE Task.

[5s] Starting State Manager Task.

[5s] - - - - - - - - - - - - - - - - - - - Processing INITIALIZE - ACTIONS - - - - - - - - - - - - - - - - - - -

[5s] 1970-01-01 - 00:00:00

[5s] Starting Recording Schedule Task.

[5s] Waiting for LTE connection for up to 60000 ms
[00:00:05.624,267] <wrn> date_time: Valid time not currently available

[17s] LTE cell changed: Cell ID: 12629783, Tracking area: 22242


[17s] RRC mode: Connected


[19s] RRC mode: Idle


[19s] LTE cell changed: Cell ID: 5492495, Tracking area: 267


[19s] RRC mode: Connected


[23s] Network registration status: Connected - roaming


[23s] LTE Connected.

[23s] Waiting for modem UTC time for up to 60000 ms

[23s] PSM parameter update: TAU: 3240, Active time: -1


[23s] Obtained modem UTC time.

[23s] Modem time obtained.

[23s] No longer waiting for 5 second button hold.

[23s] - - - - - - - - - - - - - - - - - - - Processing INITIALIZE - EXIT - - - - - - - - - - - - - - - - - - -

[23s] 2023-04-18 - 20:14:13

[23s] - - - - - - - - - - - - - - - - - - - Processing SLEEP - ENTRY - - - - - - - - - - - - - - - - - - -

[23s] 2023-04-18 - 20:14:13

[23s] - - - - - - - - - - - - - - - - - - - Processing SLEEP - ACTIONS - - - - - - - - - - - - - - - - - - -

[23s] 2023-04-18 - 20:14:13

[23s] Obtained modem UTC time.

[29s] RRC mode: Idle

                                          .
                                          .
                                          .
                                          .
                                          .

[69s] - - - - - - - - - - - - - - - - - - - Processing TRANSMIT - ENTRY - - - - - - - - - - - - - - - - - - -

[69s] 2023-04-18 - 20:14:59

[69s] Trying to subscribe to topic "result/7C:14:24:9F:60:4F".

[69s] Added subscription topic result/7C:14:24:9F:60:4F.

[69s] MQTT Socket: -100

[69s] MQTT Client ID Length: 0

[69s] Waiting for connection to MQTT Broker for up to 15000 ms.

[69s] 
Async AWS IoT Event: 

[69s] AWS_IOT_EVT_CONNECTING. Unhandled event.


[70s] RRC mode: Connected

[00:01:10.459,960] <dbg> aws_iot: broker_init: IPv4 Address found 34.236.127.7

[74s] 
Async AWS IoT Event: 

[74s] AWS_IOT_EVT_CONNECTED: 


[74s] Persistent session not enabled.


[75s] 
Async AWS IoT Event: 

[75s] AWS_IOT_EVT_READY


[75s] Connection succeeded.

[75s] - - - - - - - - - - - - - - - - - - - Processing TRANSMIT - ACTIONS - - - - - - - - - - - - - - - - - - -

[75s] 2023-04-18 - 20:15:04

[75s] - - - - - - - - - - 3 seconds left to transmit - - - - - - - - - -

[75s] Unpopulated Upload JSON:
{"CMD":"Upload","CMDSEQ":18,"DVC":"7C:14:24:9F:60:4F","RECID":{"RECTYPE":"Scheduled","SCHEDTIME":99,"RECSEQ":23},"RECUTC":"2023-04-18 - 20:14:56","RECLEN":3,"SLICEID":{"SLICESEQ":1,"SLICEOFFSET":0,"SLICELEN":1},"SLICEDATA":"[]","SLICESUM":2147483647}



[75s] Saving NV Transmit Failure Retry Count: 1
[00:01:14.250,122] <inf> net_mqtt: Connect completed
[00:01:14.250,183] <dbg> aws_iot: connect_client: Using send socket timeout of 60 seconds
[00:01:14.250,213] <dbg> aws_iot: aws_iot_cloud_poll: AWS IoT broker connection request sent
[00:01:14.620,788] <dbg> aws_iot: mqtt_evt_handler: MQTT client connected!
[00:01:14.634,582] <dbg> aws_iot: topic_subscribe: Subscribing to application topic: result/7C:14:24:9F:60:4F
[00:01:15.009,521] <dbg> aws_iot: mqtt_evt_handler: MQTT_EVT_SUBACK: id = 20793 result = 0
[00:01:15.009,552] <dbg> aws_iot: aws_iot_send: Using message ID 33081 set by the library
[00:01:15.009,582] <dbg> aws_iot: aws_iot_send: Publishing to topic: $aws/things/poc_3/shadow/get
[00:01:15.010,498] <dbg> aws_iot: device_shadow_document_request: Device shadow document requested

[75s] Attempting to transmit for up to 11000ms

[75s] Publishing 21700 bytes to topic "$aws/rules/upload"
[00:01:15.866,821] <dbg> aws_iot: aws_iot_send: Using message ID 8195 set by the application
[00:01:15.866,851] <dbg> aws_iot: aws_iot_send: Publishing to topic: $aws/rules/upload

[77s] aws_iot_send was successful.

[77s] Waiting for ACK for up to 10000 ms.

[79s] 
Async AWS IoT Event: 

[79s] AWS_IOT_EVT_DISCONNECTED

[00:01:19.677,246] <err> net_mqtt_rx: [CID 0x20017d9c]: Connection closed.
[00:01:19.677,276] <inf> net_mqtt_sock_tls: Closing socket 0
[00:01:19.691,772] <dbg> aws_iot: mqtt_evt_handler: MQTT_EVT_DISCONNECT: result = -128
[00:01:19.697,784] <err> aws_iot: Cloud MQTT input error: -128

[85s] RRC mode: Idle

*** Booting Zephyr OS build v3.0.99-ncs1  ***

I have a watch-dog timer around my publish code so if the publish code doesn't return within a specified time the app resets/reboots the thingy 91. That's why you see the Zephyr information at the end of the log.

This is the RAW output from the Trace Collector V2 preview of the transmit:

trace-2023-04-18T20-14-36_aws_disconnect_waiting_for_ack.bin

This is the prj.conf file for this app:

CONFIG_I2S=y
#CONFIG_I2S_LITEX_RX_BLOCK_COUNT=500
CONFIG_I2S_LITEX_DATA_BIG_ENDIAN=n
CONFIG_ETH_LITEETH=n
CONFIG_ETH_LITEETH_0=n
CONFIG_I2S_LOG_LEVEL_ERR=n

CONFIG_PRINTK=y
CONFIG_STDOUT_CONSOLE=y
CONFIG_FLASH=y
CONFIG_SPI_NOR=y

CONFIG_CPLUSPLUS=y 

CONFIG_FLASH_LOG_LEVEL_DBG=n
CONFIG_PWM=y

CONFIG_JSON_LIBRARY=y

CONFIG_MAIN_STACK_SIZE=8000

CONFIG_RESET_ON_FATAL_ERROR=y

# Memory Heap Pool Size
#CONFIG_HEAP_MEM_POOL_SIZE=60000

# AWS IoT
# The only CONFIG_AWS_IOT options I think need to be configured are:
#    CONFIG_AWS_IOT
#    CONFIG_AWS_IOT_BROKER_HOST_NAME
#    CONFIG_AWS_IOT_CLIENT_ID_APP
#    CONFIG_AWS_IOT_CLIENT_ID_STATIC
#    CONFIG_AWS_IOT_APP_SUBSCRIPTION_LIST_COUNT
#    CONFIG_AWS_IOT_SEC_TAG
#    CONFIG_AWS_IOT_LOG_LEVEL_DBG
#    
CONFIG_AWS_IOT=y

CONFIG_AWS_IOT_BROKER_HOST_NAME="xxxxxxxxxxxx-ats.iot.xx-xxxxx-1.amazonaws.com"

CONFIG_AWS_IOT_CLIENT_ID_APP=n

CONFIG_AWS_IOT_CLIENT_ID_STATIC="xxxxxx"

CONFIG_AWS_IOT_SEC_TAG=310

CONFIG_AWS_IOT_APP_SUBSCRIPTION_LIST_COUNT=1
# This is the value from the NCS v2.0.0, AWS IoT sample
#CONFIG_AWS_IOT_APP_SUBSCRIPTION_LIST_COUNT=2

# These options are configured in the NCS v2.0.0 AWS IoT sample
# but I don't use them
#CONFIG_AWS_IOT_TOPIC_UPDATE_DELTA_SUBSCRIBE=y
#CONFIG_AWS_IOT_LAST_WILL=y
#CONFIG_AWS_IOT_TOPIC_GET_ACCEPTED_SUBSCRIBE=y
#CONFIG_AWS_IOT_TOPIC_GET_REJECTED_SUBSCRIBE=y


# Logging
CONFIG_LOG=y
# Originally none of these were enabled
CONFIG_AWS_IOT_LOG_LEVEL_DBG=y                 #1
CONFIG_NET_LOG=y                               #2
CONFIG_NET_NATIVE=n                            #3
CONFIG_MQTT_LOG_LEVEL_DBG=n                    #4
CONFIG_NET_SOCKETS_LOG_LEVEL_DBG=n             #5
CONFIG_NRF_MODEM_LIB_LOG_LEVEL_DBG=n           #6
CONFIG_NRF_MODEM_LIB_DEBUG_SHM_TX_ALLOC=n      #7
CONFIG_NRF_MODEM_LIB_DEBUG_ALLOC=n             #8

# Modem Trace
CONFIG_NRF_MODEM_LIB_TRACE_ENABLED=y

# MQTT
CONFIG_MQTT_LIB=y
CONFIG_MQTT_LIB_TLS=y
CONFIG_MQTT_KEEPALIVE=1200

# Networking
CONFIG_NETWORKING=y

# LTE link control
CONFIG_LTE_LINK_CONTROL=y
CONFIG_LTE_NETWORK_MODE_LTE_M=y
CONFIG_LTE_AUTO_INIT_AND_CONNECT=n

# Date Time library
CONFIG_DATE_TIME=y
CONFIG_DATE_TIME_MODEM=y
CONFIG_DATE_TIME_NTP=n

# HW Info
CONFIG_HWINFO=y

# Modem library
CONFIG_NRF_MODEM_LIB=y
#CONFIG_NRF_MODEM_LIB_SENDMSG_BUF_SIZE=2048
#CONFIG_NRF_MODEM_LIB_HEAP_SIZE=1024
CONFIG_NRF_MODEM_LIB_SHMEM_TX_SIZE=32768
CONFIG_MODEM_ANTENNA_LOG_LEVEL_DBG=n
CONFIG_MODEM=y
CONFIG_MODEM_LOG_LEVEL_DBG=n
CONFIG_MODEM_CONTEXT=y
CONFIG_MODEM_CONTEXT_VERBOSE_DEBUG=y

CONFIG_MQTT_CLEAN_SESSION=y

# Modem information
CONFIG_MODEM_INFO=y

# Reboot
CONFIG_REBOOT=y

I need help trying to figure out why the AWS_IOT_EVT_DISCONNECTED is happening.

Could it be because the message ID's are different as shown in the following log file lines:

[00:01:15.009,552] <dbg> aws_iot: aws_iot_send: Using message ID 33081 set by the library

                                             and

[00:01:15.866,821] <dbg> aws_iot: aws_iot_send: Using message ID 8195 set by the application ?

I don't see any place for me to set .message_id in the aws_iot_data structure that I send to aws_iot_send so the first log line about the library setting the message ID seems right. But why don't they both use the same ID then and how did the second log line get the idea that the application set the ID it used?

Any insight or advice would be greatly appreciated.

Parents
  • The message id is set by pseudorandom, get_sys_tick_u32() or something like that by the aws_iot library, unless set manually. The two messages listed there are separate messages. If sent with QoS 1 you should be getting PUBACK for those same ID's if it's successfully received by the broker.

    Have you checked that you're sending with QoS 1 or 2? PUBACK won't be sent for QoS 0

  • I am setting the qos member of the aws_iot_data structure to MQTT_QOS_1_AT_LEAST_ONCE.

    I expect to get a AWS_IOT_EVT_PUBACK event in the callback function I provided when I aws_iot_init not a AWS_IOT_EVT_DISCONNECTED.

    My app has three large buffers to send. It tries an aws_iot_send with the first and if successful will move on to the next and so on. I have tried a .qos of MQTT_QOS_0_AT_MOST_ONCE, and not waited on an acknowledgement/PUBACK. In this scenario, while the aws_iot_send returns successfully the first and second time, the AWS_IOT_EVT_DISCONNECTED happens before the third buffer and that one naturally fails since the socket isn't connected.

    For my app, the AWS_IOT_EVT_DISCONNECTED always occurs 2 to 3 seconds after the first aws_iot_send.

  • The aws_iot_send uses Zephyr native mqtt_publish() which is non blocking I believe. Successful return does not guarantee that the message was successfully sent or received, I think it returns after it's been written to the socket or, if an error occurred before that (e.g client state was disconnected before the mqtt_publish() call).

    I just saw a bit similar behaviour with one of our demo devices, it started disconnecting from AWS with the reason CONNECTION_LOST, and this happening as it was receiving and updating the Thing Shadow. Are you able to delete the Thing Shadow from the AWS-side?

Reply
  • The aws_iot_send uses Zephyr native mqtt_publish() which is non blocking I believe. Successful return does not guarantee that the message was successfully sent or received, I think it returns after it's been written to the socket or, if an error occurred before that (e.g client state was disconnected before the mqtt_publish() call).

    I just saw a bit similar behaviour with one of our demo devices, it started disconnecting from AWS with the reason CONNECTION_LOST, and this happening as it was receiving and updating the Thing Shadow. Are you able to delete the Thing Shadow from the AWS-side?

Children
Related