Various errors during medium-term deployment of cellular-enabled sensor device
We've prototyped an agricultural gas sensor based on the nRF9160, which takes readings every 15 minutes and uploads them to a server through MQTT. It's loosely based upon the "mqtt_simple" sample. We're using modem firmware v1.1.2 (the highest supported on Verizon) and NCS tag v1.2.1. (We tried NCS v1.3.0 briefly but seemed to have some problem with getting it to work.) The control loop in our code is currently set to connect and disconnect each time a data point needs to be uploaded. Although we considered eDRX, other users reported issues with it and Verizon, so we decided to hold off for now. Also, we're using LTE-M with SIM cards obtained through ThingSpace.
Our problems seem to happen after the device has been working for a while. One of the devices communicated for a full week, and others fail a few times per week depending on the data rate. We have one prototype connected to a debugger (i.e. the nRF9160 devkit), instead using a two-minute data collection interval, and we've seen a number of errors occur. Most of these errors manifested in crashes and seemed to be due to too many "printk()" calls through the Segger RTT. This is an entirely separate topic which should be resolved but is not of as much concern to us now, given that we've since slowed down the textual debugging message rate (most likely stack smashing or something similar). Furthermore, the non-debugged devices don't have "printk()" calls enabled at all, so this wouldn't have affected them. We're now aware of five remaining, unresolved errors:
1. "mqtt_connect()" returns -128 and won't recover until the nRF9160 is totally reset. This needs to be fixed. It's remotely possible that this and #2 are caused by the debugger.
2. "mqtt_connect()" returns -115 but does recover shortly. We'd like to avoid this happening if possible, but aren't sure how.
3. Opening a socket for AT commands fails with errno=12, which is "out of memory". This appears to be non-recoverable without a reset. At first it manifests itself as not being able to check the signal strength while connected, while still being possible if unconnected (e.g. during a tower search). This suggests that being connected takes up a certain amount of memory, which is sensible. Then another piece of memory leaks shortly thereafter and "AT+CFUN=1" also fails. The memory space assigned to "malloc()" doesn't seem to be leaking at all, so it's probably internal to the modem. Overall, using AT commands has been a bit of a struggle.
4. NTP sockets fail to open after a certain number of connect/disconnect cycles. This is probably related to #3. We haven't been using NTP because of this for quite a while, and instead put a time request through to our MQTT server. This happened on modem firmware v1.2.0, actually, and most likely using an iBasis SIM card.
5. An unknown hang in the main thread. The UI thread, which handles LED display pattern timing and on/off switch status, keeps running. We haven't seen this recently through the debugger, although some time in the past we believe it to have happened, possibly while the code was still under development. A pause in the debugger would bring us to an idle loop with no call stack shown at all, making it hard to know what exactly occurred. The whole purpose of the separate UI thread was to keep the LEDs blinking properly during the sometimes-laggy AT commands. This is the main concerning item with the deployed devices.
More details and context can be provided for the first four points if needed, including screenshots of SES. Our full code can be shared but it would be best to do this confidentially (we'll open a private ticket). We additionally had some questions:
A. Is there some material limit on how many times a device can connect without resetting? Why is at least the older Asset Tracker coded to reset the whole application if a connection fails?
B. In the "mqtt_simple" sample, if "broker_init()" fails via "getaddrinfo()", the application still tries to connect to MQTT. What then is the expected behavior of "mqtt_connect()"? Will it crash? We escape from the connection sequence in our newest code if "getaddrinfo()" fails (although we use a direct IP address anyway), but our deployed devices don't have this modification yet.
C. There's been talk of the BSD library (see Emil Lenngren's recent thread) affecting certain low-level calls. How do we determine which/whose BSD library we're using?
D. Are any commands available to casually monitor the status of the modem's inner workings, such as available memory (to detect a leak)? Do positive "socket()" return values offer any insight? Does the modem firmware run on a different processor completely?
We're working on getting a modem trace, but this means rewriting our code not to use the serial gas sensor so that it can run on the nRF9160 devkit. Our devkit also uses modem firmware v1.2.0, and apparently downgrading is impossible, so this could affect the results and we likely won't be able to connect to Verizon. Thanks for any help you can provide and please feel free to ask any questions if you wish.