Handle Weak LTE Connection

My application is running on a nRF9160 using NCS v1.7.0 and modem FW 1.3.0. Sometimes, when our device is in an area with bad LTE coverage, I believe we're overflowing some buffer on either the modem side or somewhere deeper down in the networking stack on our application side. For example, our device will connect to LTE, connect to MQTT, then start sending data. The device sends data (about 500 bytes) every minute. We call aws_iot_send, receive no errors, and continue doing this for several minutes. As far as the device can tell based on MQTT events and LTE_LC events, the device is still connected. I can tell the service is bad because we don't receive PUBACKs for every single message sent and the ones that are received take a while to show up. We eventually get an LTE and MQTT disconnect event, so we stop queuing up messages. Then, when we connect back to the network, we try to reconnect to MQTT or we do reconnect to MQTT and send messages again, but the application crashes and the device reboots. I have a sample log here:

[00:00:44.006,042] <inf> aws_mgr: Sending message on topic $aws/things/H9_7Nwn8/shadow/update
[00:00:47.265,625] <wrn> data_mgr: App msg id 5 for type 3 timed out
[00:00:47.271,575] <wrn> data_mgr: App msg id 6 for type 26 timed out
[00:00:49.005,676] <wrn> data_mgr: App msg id 7 for type 7 timed out
[00:00:54.010,437] <inf> data_mgr: Saving data type 26 (48 bytes) at index 0
[00:00:54.010,437] <inf> data_mgr: Sync data requested
[00:00:54.010,559] <dbg> data_mgr.create_sync_messages: Able to queue msg id 8
[00:00:54.010,681] <dbg> data_mgr.create_sync_messages: Able to queue msg id 9
[00:00:54.010,803] <dbg> data_mgr.create_sync_messages: Able to queue msg id 10
[00:00:54.010,925] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/s
[00:00:54.013,275] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/m2
[00:00:54.014,526] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/b3
[00:00:54.017,120] <dbg> data_mgr.send_oldest_flash_data: sizeof enum app_data_type_t: 1
[00:00:54.017,120] <dbg> data_mgr.send_oldest_flash_data: Type 26 read from flash at nvs id 2
[00:00:54.017,272] <dbg> data_mgr.send_oldest_flash_data: Able to queue msg id 11
[00:00:54.022,735] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/b3
[00:00:59.010,559] <wrn> data_mgr: App msg id 8 for type 3 timed out
[00:00:59.010,681] <wrn> data_mgr: App msg id 9 for type 25 timed out
[00:00:59.010,803] <wrn> data_mgr: App msg id 10 for type 26 timed out
[00:00:59.017,272] <wrn> data_mgr: App msg id 11 for type 26 timed out
[00:01:04.004,302] <inf> data_mgr: Saving data type 26 (48 bytes) at index 1
[00:01:14.009,979] <inf> data_mgr: Saving data type 26 (48 bytes) at index 2
[00:01:24.009,979] <inf> data_mgr: Saving data type 26 (48 bytes) at index 3
[00:01:32.506,317] <inf> data_mgr: Puback received 5
[00:01:32.507,263] <inf> aws_mgr: AWS_IOT_EVT_READY
[00:01:32.508,850] <inf> data_mgr: Puback received 6
[00:01:32.509,185] <inf> lte_mgr: UUID resp e7f4d5d4-f607-43a6-b383-cfdb6d725c82
[00:01:32.516,754] <dbg> data_mgr.send_reported_shadow_msg: Able to queue msg id 12
[00:01:32.516,845] <dbg> data_mgr.send_get_job_msg: Able to queue msg id 13
[00:01:33.067,108] <err> crypto: Couldn't load server auth pubkey err -140
[00:01:33.067,108] <inf> band_v2: Band hasn't exchanged ECDSA public keys with server, sending message
[00:01:33.618,530] <dbg> data_mgr.send_cert_init_msg: Able to queue msg id 14
[00:01:33.619,262] <inf> aws_mgr: Sending message on topic $aws/things/H9_7Nwn8/shadow/update
[00:01:33.619,720] <inf> data_mgr: Puback received 7
[00:01:33.626,373] <inf> aws_mgr: Sending message on topic $aws/things/H9_7Nwn8/jobs/$next/get
[00:01:33.627,441] <err> aws_mgr: Free topic $aws/things/H9_7Nwn8/jobs/$next/get
[00:01:33.627,593] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/cert-init
[00:01:33.628,753] <err> aws_mgr: Free topic band/H9_7Nwn8/cert-init
[00:01:34.009,979] <inf> data_mgr: Saving data type 26 (48 bytes) at index 4
[00:01:37.516,754] <wrn> data_mgr: App msg id 12 for type 7 timed out
[00:01:37.516,876] <wrn> data_mgr: App msg id 13 for type 15 timed out
[00:01:38.618,560] <wrn> data_mgr: App msg id 14 for type 17 timed out
[00:01:43.992,065] <inf> band_v2: Check in with cloud
[00:01:44.543,884] <err> crypto: Couldn't load server auth pubkey err -140
[00:01:44.543,884] <inf> band_v2: Band hasn't exchanged ECDSA public keys with server, sending message
[00:01:45.094,909] <dbg> data_mgr.send_cert_init_msg: Able to queue msg id 15
[00:01:45.095,092] <inf> data_mgr: Saving data type 26 (48 bytes) at index 5
[00:01:45.095,123] <inf> data_mgr: Sync data requested
[00:01:45.095,214] <dbg> data_mgr.create_sync_messages: Able to queue msg id 16
[00:01:45.095,245] <wrn> data_mgr: Sync msg id 15 for type 17 already queud
[00:01:45.095,367] <dbg> data_mgr.create_sync_messages: Able to queue msg id 17
[00:01:45.095,489] <dbg> data_mgr.create_sync_messages: Able to queue msg id 18
[00:01:45.095,703] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/cert-init
[00:01:45.102,813] <err> aws_mgr: Free topic band/H9_7Nwn8/cert-init
[00:01:45.102,905] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/s
[00:01:45.104,339] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/m2
[00:01:45.106,140] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/b3
[00:01:45.107,513] <dbg> data_mgr.send_oldest_flash_data: sizeof enum app_data_type_t: 1
[00:01:45.107,513] <dbg> data_mgr.send_oldest_flash_data: Type 26 read from flash at nvs id 2
[00:01:45.107,666] <dbg> data_mgr.send_oldest_flash_data: Able to queue msg id 19
[00:01:45.107,757] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/b3
[00:01:50.094,909] <wrn> data_mgr: App msg id 15 for type 17 timed out
[00:01:50.095,245] <wrn> data_mgr: App msg id 16 for type 3 timed out
[00:01:50.095,367] <wrn> data_mgr: App msg id 17 for type 25 timed out
[00:01:50.095,489] <wrn> data_mgr: App msg id 18 for type 26 timed out
[00:01:50.107,666] <wrn> data_mgr: App msg id 19 for type 26 timed out
[00:01:54.009,979] <inf> data_mgr: Saving data type 26 (48 bytes) at index 6
[00:02:04.033,782] <inf> data_mgr: Saving data type 26 (48 bytes) at index 7
[00:02:14.010,314] <inf> data_mgr: Saving data type 26 (48 bytes) at index 8
[00:02:15.823,547] <inf> lte_mgr: RRC mode: Idle
[00:02:16.362,304] <inf> lte_mgr: RRC mode: Connected
[00:02:21.365,753] <inf> lte_mgr: RRC mode: Idle
[00:02:22.346,466] <inf> lte_mgr: RRC mode: Connected
[00:02:24.009,857] <inf> data_mgr: Saving data type 26 (48 bytes) at index 9
[00:02:34.009,948] <dbg> data_mgr.flush_data_to_flash: Flushing 10 elements type 26 buffer to flash starting at index 0
[00:02:34.015,411] <inf> data_mgr: Saving data type 26 (48 bytes) at index 0
[00:02:37.326,690] <err> supl_support: Unable to connect, errno 116
[00:02:37.590,911] <inf> data_mgr: Puback received 8
[00:02:37.623,046] <inf> data_mgr: Puback received 10
[00:02:43.992,065] <inf> band_v2: Check in with cloud
[00:02:44.544,067] <err> crypto: Couldn't load server auth pubkey err -140
[00:02:44.544,067] <inf> band_v2: Band hasn't exchanged ECDSA public keys with server, sending message
[00:02:45.092,102] <dbg> data_mgr.send_cert_init_msg: Able to queue msg id 20
[00:02:45.092,315] <inf> data_mgr: Saving data type 26 (48 bytes) at index 1
[00:02:45.092,315] <inf> data_mgr: Sync data requested
[00:02:45.092,437] <dbg> data_mgr.create_sync_messages: Able to queue msg id 21
[00:02:45.092,437] <wrn> data_mgr: Sync msg id 20 for type 17 already queud
[00:02:45.092,559] <dbg> data_mgr.create_sync_messages: Able to queue msg id 22
[00:02:45.092,681] <dbg> data_mgr.create_sync_messages: Able to queue msg id 23
[00:02:45.092,895] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/cert-init
[00:02:45.099,975] <err> aws_mgr: Free topic band/H9_7Nwn8/cert-init
[00:02:45.100,067] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/s
[00:02:45.101,409] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/m2
[00:02:45.102,722] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/b3
[00:02:45.105,224] <dbg> data_mgr.send_oldest_flash_data: sizeof enum app_data_type_t: 1
[00:02:45.105,224] <dbg> data_mgr.send_oldest_flash_data: Type 26 read from flash at nvs id 2
[00:02:45.105,377] <dbg> data_mgr.send_oldest_flash_data: Able to queue msg id 24
[00:02:45.105,468] <inf> aws_mgr: Sending message on topic band/H9_7Nwn8/b3
[00:02:50.092,102] <wrn> data_mgr: App msg id 20 for type 17 timed out
[00:02:50.092,437] <wrn> data_mgr: App msg id 21 for type 3 timed out
[00:02:50.092,559] <wrn> data_mgr: App msg id 22 for type 25 timed out
[00:02:50.092,681] <wrn> data_mgr: App msg id 23 for type 26 timed out
[00:02:50.105,377] <wrn> data_mgr: App msg id 24 for type 26 timed out
[00:02:54.009,918] <inf> data_mgr: Saving data type 26 (48 bytes) at index 2
[00:02:54.262,786] <inf> lte_mgr: RRC mode: Idle
[00:02:55.916,748] <inf> aws_mgr: AWS_IOT_EVT_DISCONNECTED
[00:02:55.916,778] <inf> aws_mgr: AWS_IOT_EVT_DISCONNECTED reason 3
[00:02:55.916,809] <err> aws_mgr: Couldn't send AWS msg error -11
[00:02:56.491,363] <inf> lte_mgr: RRC mode: Connected
[00:02:58.922,882] <err> at_cmd: Response buffer not large enough
[00:03:04.010,040] <inf> data_mgr: Saving data type 26 (48 bytes) at index 3
[00:03:07.937,866] <inf> lte_mgr: RRC mode: Idle
[00:03:12.907,714] <inf> lte_mgr: RRC mode: Connected
[00:03:14.010,040] <inf> data_mgr: Saving data type 26 (48 bytes) at index 4
[00:03:24.010,009] <inf> data_mgr: Saving data type 26 (48 bytes) at index 5
[00:03:34.010,009] <inf> data_mgr: Saving data type 26 (48 bytes) at index 6
[00:03:35.055,480] <inf> lte_mgr: RRC mode: Idle
[00:03:43.560,272] <inf> lte_mgr: reg status home or roam ? 0
[00:03:43.560,302] <inf> lte_mgr: LTE network registration status 4
[00:03:43.560,302] <inf> lte_mgr: LTE cell changed: Cell ID: -1, Tracking area: -1
[00:03:43.605,133] <inf> lte_mgr: reg status home or roam ? 1
[00:03:43.605,316] <inf> lte_mgr: LTE network registration status 1
[00:03:43.605,316] <inf> lte_mgr: LTE cell changed: Cell ID: 44417027, Tracking area: 44297
[00:03:43.605,438] <dbg> date_time.new_date_time_get: Updating date time UTC...
[00:03:43.605,438] <dbg> date_time.new_date_time_get: Time successfully obtained
[00:03:43.605,468] <inf> lte_mgr: DATE_TIME_OBTAINED_EXT
[00:03:43.605,468] <inf> aws_mgr: AWS Status 4
[00:03:43.605,499] <inf> aws_mgr: LTE Link up, connecting to IoT Core
[00:03:43.610,260] <inf> aws_mgr: AWS_IOT_EVT_CONNECTING
[00:03:43.962,280] <inf> data_mgr: Saving data type 5 (148 bytes) at index 0
[00:03:43.992,095] <inf> band_v2: Check in with cloud
[00:03:44.554,992] <err> crypto: Couldn't load server auth pubkey err -140
[00:03:44.554,992] <inf> band_v2: Band hasn't exchanged ECDSA public keys with server, sending message
[00:03:45.101,501] <wrn> data_mgr: Couldn't queue msg id 25 err -67
[00:03:45.101,898] <inf> lte_mgr: RRC mode: Connected
[00:03:45.102,203] <inf> data_mgr: Saving data type 26 (48 bytes) at index 7
[00:03:45.102,203] <inf> data_mgr: Sync data requested
[00:03:45.102,508] <wrn> data_mgr: Couldn't queue msg id 26 err -67
[00:03:45.102,661] <wrn> data_mgr: Couldn't queue msg id 26 err -67
[00:03:45.102,783] <wrn> data_mgr: Couldn't queue msg id 26 err -67
[00:03:45.102,935] <wrn> data_mgr: Couldn't queue msg id 26 err -67
[00:03:45.111,541] <dbg> data_mgr.send_oldest_flash_data: sizeof enum app_data_type_t: 1
[00:03:45.111,572] <dbg> data_mgr.send_oldest_flash_data: Type 26 read from flash at nvs id 2
[00:03:45.111,724] <wrn> data_mgr: Couldn't queue msg id 26 err -67
[00:03:54.010,009] <inf> data_mgr: Saving data type 26 (48 bytes) at index 8
[00:04:04.010,040] <inf> data_mgr: Saving data type 26 (48 bytes) at index 9
[00:04:14.010,040] <dbg> data_mgr.flush_data_to_flash: Flushing 10 elements type 26 buffer to flash starting at index 0
[00:04:14.015,777] <inf> data_mgr: Saving data type 26 (48 bytes) at index 0
[00:04:24.010,040] <inf> data_mgr: Saving data type 26 (48 bytes) at index 1
[00:00:00.001,037] <inf> regulator_fixed: vdd-pwr-ctrl sync: 0
*** Booting Zephyr OS build v2.6.99-ncs1  ***

[00:00:00.008,453] <inf> fs_nvs: 2 Sectors of 4096 bytes
[00:00:00.008,483] <inf> fs_nvs: alloc wra: 0, f20
[00:00:00.008,483] <inf> fs_nvs: data wra: 0, 154
[00:00:00.008,544] <inf> band_v2: RR: 0x00010000
[00:00:00.008,575] <inf> band_v2: reset for reason other than deep-sleep or watchdog
[00:00:00.008,575] <inf> app_watchdog: Watchdog set up with a 60s interval
[00:00:00.766,448] <inf> fs_nvs: 20 Sectors of 4096 bytes
[00:00:00.766,448] <inf> fs_nvs: alloc wra: 0, eb8
[00:00:00.766,448] <inf> fs_nvs: data wra: 0, 64a
[00:00:00.887,969] <inf> fs_nvs: 762 Sectors of 4096 bytes
[00:00:00.887,969] <inf> fs_nvs: alloc wra: 0, f68
[00:00:00.888,000] <inf> fs_nvs: data wra: 0, 8dc
[00:00:00.890,350] <err> band_v2: Rebooted unexpectedly

Another piece of info is that we call lte_lc_neighbor_cell_measurement every 5 min. Any help with how I could better handle this would be appreciated.

Related