Device intermittent failures to connect to nRF Cloud MQTT broker (err: -111)

Hi,

Intermittently, after a reset/restart some of our devices are unable to connect to the nRF Cloud MQTT broker:

[00:00:09.235,626] <inf> app_cloud_connection: Network connectivity gained!
[00:00:10.236,114] <inf> app_cloud_connection: Network is ready
[00:00:10.236,145] <inf> app_cloud_connection: Connecting to nRF Cloud
[00:01:00.567,901] <wrn> app_application: Cloud not ready within 60 seconds, proceeding with app startup
[00:01:10.252,746] <err> nrf_cloud_transport: Could not connect to nRF Cloud MQTT Broker mqtt.nrfcloud.com, port: 45858. err: -111
[00:01:10.252,807] <inf> app_cloud_connection: Disconnecting from nRF Cloud
[00:01:10.257,385] <inf> app_cloud_connection: Could not connect to nRF Cloud

[00:01:10.414,062] <inf> app_cloud_connection: Retrying nRF Cloud connection in 30 seconds...

Notes:

  • This affects multiple devices.
  • The error is intermittent. The devices connected fine just before the restart
  • Once the error occurs, the device retries to connect but keeps getting this error indefinitely
  • The device connection (LTE-M) appears to be working correctly as can be seen in the log, and we also see data flow in the SIM provider portal
  • After another restart, the devices connect fine again.

Once a device is connected, we typically don't see this error.

It feels like there is a temporary network routing, firewall, or broker issue that prevents the device from connecting once. But the device network stack then gets stuck in a state where it now fails to reconnect again, even if the temporary problem went away. Restarting the device clears that state and allows the device to reconnect instantly.

Have you seen this before? Is there a workaround other then restarting the device?

Thanks

Parents
  • Communication has continued through email.

    Summary:

    • They are using NCS v2.9.0-7787b2649840 which had a bug on how the port value was presented in the serial log, 45858 (0xB322) is 8883 (0x22B3) with bytes swapped due the use of hton. This was fixed in this PR in v3.3.0-preview1:
    • I have requested to share a serial log with CONFIG_NRF_CLOUD_LOG_LEVEL_DBG enabled when the error happens.
    • In a separate board, upgrade MFW to 2.0.4 and try to reproduce the error, share also the log with CONFIG_NRF_CLOUD_LOG_LEVEL_DBG enabled.
    • The error happens after executing a firmware upgrade of the application through SWD, I recommended to disable both the transmit and receive RF circuits and deactivate LTE and GNSS before executing the reset. This can be achieved by executing lte_lc_power_off() and wait 10 seconds before the reset.

    Regards,

    Pascal.

Reply
  • Communication has continued through email.

    Summary:

    • They are using NCS v2.9.0-7787b2649840 which had a bug on how the port value was presented in the serial log, 45858 (0xB322) is 8883 (0x22B3) with bytes swapped due the use of hton. This was fixed in this PR in v3.3.0-preview1:
    • I have requested to share a serial log with CONFIG_NRF_CLOUD_LOG_LEVEL_DBG enabled when the error happens.
    • In a separate board, upgrade MFW to 2.0.4 and try to reproduce the error, share also the log with CONFIG_NRF_CLOUD_LOG_LEVEL_DBG enabled.
    • The error happens after executing a firmware upgrade of the application through SWD, I recommended to disable both the transmit and receive RF circuits and deactivate LTE and GNSS before executing the reset. This can be achieved by executing lte_lc_power_off() and wait 10 seconds before the reset.

    Regards,

    Pascal.

Children
  • Hi, 

    Thanks, Pascal. To close the loop on this:

    We still see the err: -111 occasionally (with and without SWD) and have not yet been able to reproduce it reliably or find the root cause. Sometimes it recovers by itself, but not always. We have found that the sequence of 

    conn_mgr_all_if_disconnect(true)
    lte_lc_offline()
    lte_lc_normal()
    conn_mgr_all_if_connect(true)

    usually restores the cloud connection in these circumstances (without a reboot). 

    We will also try MFW 2.0.4.

    Best,

    Terrence

  • Hello Terrence,

    The use of the conn_mgr feature is asynchronous, so it depends on the events. The following example demonstrates its use:
    Your connection handler should be able to catch the disconnection event (NET_EVENT_L4_DISCONNECTED), and your application/business logic should be able to execute conn_mgr_all_if_connect(true) when it's time to re-establish the connection. Using conn_mgr_all_if_disconnect(true) achieves the same result as using the lte_lc_power_off() wrapper, so for the sake of clarity, you should not combining them neither call them sequentially.
    Regards,
    Pascal.
Related