NRF9160 SLM self reboot cause

Hello,

I'm using the SLM application with an NRF9160.  I'm sending AT commands for MQTT publish.  After ~10hours the device reboots.

(22:56:29.123) AT%XTEMP?
(22:56:29.123) AT#XMQTTPUB="sensor/255/battery/V","11.59",0,0
(22:56:29.162) CPU Temperature: 28 deg
(22:56:29.162) AT#XMQTTPUB="sensor/255/cpu_temp/C","28",0,0
(22:56:30.130) SLM: [11:00:38.999,267] <ESC>[0m<dbg> slm_at_host: cmd_send: RX
(22:56:30.130) SLM:                                       41 54 25 58 54 45 4d 50  3f                      |AT%XTEMP ?       <ESC>[0m
(22:56:30.156) SLM: [11:00:39.008,331] <ESC>[0m<dbg> slm_at_host: slm_at_send_indicate: TX
(22:56:30.176) SLM:                                       0d 0a 25 58 54 45 4d 50  3a 20 32 38 0d 0a 0d 0a |..%XTEMP : 28....
(22:56:30.176) SLM:                                       4f 4b 0d 0a                                      |OK..             <ESC>[0m
(22:56:30.194) SLM: [11:00:39.008,392] <ESC>[0m<dbg> slm_at_host: cmd_send: RX
(22:56:30.194) SLM:                                      41 54 23 58 4d 51 54 54  0 42 3d 22 73 65 6e |AT#XMQTT PUB="sen
(22:56:30.213) SLM:                                       73 6f 72 2f 32 35 35 2f   61 74 74 65 72 79 2f |sor/255/ battery/
(22:56:30.213) SLM:                                       56 22 2c 22 31 31 2e 35   22 2c 30 2c 30       |V","11.5 9",0,0  <ESC>[0m
(22:56:30.213) SLM: [11:00:39.015,869] <ESC>[0m<dbg> slm_at_host: slm_at_send_indicate:X
(22:56:30.213) SLM:                                       0d 0a 4f 4b 0d 0a                             |..OK..           <ESC>[0m
(22:56:30.237) SLM: [11:00:39.016,815] <ESC>[0m<dbg> slm_at_host: cmd_send: RX
(22:56:30.237) SLM:                                    41 54 23 58 4d 51 54 54  50 42 3d 22 73 65 6e |AT#XMQTT PUB="sen
(22:56:30.268) SLM:                                       73 6f 72 2f 32 35 35 2f  70 75 5f 74 65 6d 70 |sor/255/ cpu_temp
(22:56:30.268) SLM:                                       2f 43 22 2c 22 32 38 22   30 2c 30             |/C","28" ,0,0    <ESC>[0m
(22:56:30.268) SLM: [11:00:39.017,639] <ESC>[0m<dbg> slm_at_host: slm_at_send_indicate:X
(22:56:30.268) SLM:                                       0d 0a 4f 4b 0d 0a                             |..OK..           <ESC>[0m
(22:56:35.795) SLM: *** Booting nRF Connect SDK v2.8.0-a2386bfc8401 ***
(22:56:35.831) SLM: *** Using Zephyr OS v3.7.99-0bc3393fb112 ***
(22:56:35.831) SLM: [00:00:00.252,044] <ESC>[0m<inf> at_cmd_custom: Custom AT commands enabled with 70 entries.<ESC>[0m
(22:56:35.831) SLM: [00:00:00.252,075] <ESC>[0m<dbg> slm: main: RR: 0x00000000<ESC>[0m
(22:56:35.831) SLM: [00:00:00.257,873] <ESC>[0m<inf> fs_nvs: 2 Sectors of 4096 bytes<ESC>[0m
(22:56:35.831) SLM: [00:00:00.257,873] <ESC>[0m<inf> fs_nvs: alloc wra: 0, fe8<ESC>[0m
(22:56:35.854) SLM: [00:00:00.257,873] <ESC>[0m<inf> fs_nvs: data wra: 0, 0<ESC>[0m
(22:56:36.227) SLM: Ready
(22:56:36.227) SLM: [00:00:00.491,821] <ESC>[0m<inf> slm: lib_modem init: 0<ESC>[0m
(22:56:36.249) SLM: [00:00:00.491,912] <ESC>[0m<inf> mcuboot_util: Image index: 0, Swap type: none<ESC>[0m
(22:56:36.249) SLM: [00:00:00.491,912] <ESC>[0m<inf> slm: Serial LTE Modem<ESC>[0m
(22:56:36.275) SLM: [00:00:00.585,327] <ESC>[0m<inf> nrf_cloud_info: Device ID: 504d5632-3736-4649-808b-2321133ae3a6<ESC>[0m
(22:56:36.275) SLM: [00:00:00.590,576] <ESC>[0m<inf> nrf_cloud_info: IMEI:      350457793853522<ESC>[0m
(22:56:36.275) SLM: [00:00:00.682,861] <ESC>[0m<inf> nrf_cloud_info: UUID:      504d5632-3736-4649-808b-2321133ae3a6<ESC>[0m
(22:56:36.275) SLM: [00:00:00.688,598] <ESC>[0m<inf> nrf_cloud_info: Modem FW:  mfw_nrf9160_1.3.7<ESC>[0m
(22:56:36.307) SLM: [00:00:00.688,629] <ESC>[0m<inf> nrf_cloud_info: Protocol:          MQTT<ESC>[0m
(22:56:36.307) SLM: [00:00:00.688,659] <ESC>[0m<inf> nrf_cloud_info: Download protocol: HTTPS<ESC>[0m
(22:56:36.307) SLM: [00:00:00.688,690] <ESC>[0m<inf> nrf_cloud_info: Sec tag:           16842753<ESC>[0m
(22:56:36.307) SLM: [00:00:00.688,720] <ESC>[0m<inf> nrf_cloud_info: Host name:         mqtt.nrfcloud.com<ESC>[0m
(22:56:36.335) SLM: [00:00:00.688,812] <ESC>[0m<inf> slm_uart_handler: UART baud: 11520 d/p/s-bits: 3/0/1 HWFC: 0<ESC>[0m
(22:56:36.335) SLM: [00:00:00.688,964] <ESC>[0m<dbg> slm_at_host: slm_at_send_indicate:
(22:56:36.373) SLM:                                       52 65 61 64 79 0d 0a                          |Ready..          <ESC>[0m
(22:56:36.373) SLM: [00:00:00.688,964] <ESC>[0m<inf> slm_at_host: at_host init done<ESC>[0m[00:00:00.689,025] <ESC>[0m<dbg> slm: configure_power_pin_interrupt: Configured interrupt (0x3400000) on power pin (6).<ESC>[0m
(22:56:59.010) AT%XTEMP?

The setup AT commands from powerup are as follows:

(11:55:51.312) SLM: *** Booting nRF Connect SDK v2.8.0-a2386bfc8401 ***
(11:55:51.343) SLM: *** Using Zephyr OS v3.7.99-0bc3393fb112 ***
(11:55:51.343) SLM: [00:00:00.252,014] <ESC>[0m<inf> at_cmd_custom: Custom AT commands enabled with 70 entries.<ESC>[0m
(11:55:51.365) SLM: [00:00:00.252,044] <ESC>[0m<dbg> slm: main: RR: 0x00000000<ESC>[0m
(11:55:51.365) SLM: [00:00:00.257,843] <ESC>[0m<inf> fs_nvs: 2 Sectors of 4096 bytes<ESC>[0m
(11:55:51.365) SLM: [00:00:00.257,843] <ESC>[0m<inf> fs_nvs: alloc wra: 0, fe8<ESC>[0m
(11:55:51.365) SLM: [00:00:00.257,843] <ESC>[0m<inf> fs_nvs: data wra: 0, 0<ESC>[0m
(11:55:51.743) SLM: Ready
(11:55:51.743) SLM: [00:00:00.491,577] <ESC>[0m<inf> slm: lib_modem init: 0<ESC>[0m
(11:55:51.772) SLM: [00:00:00.491,668] <ESC>[0m<inf> mcuboot_util: Image index: 0, Swap type: none<ESC>[0m
(11:55:51.772) SLM: [00:00:00.491,668] <ESC>[0m<inf> slm: Serial LTE Modem<ESC>[0m
(11:55:51.800) SLM: [00:00:00.585,083] <ESC>[0m<inf> nrf_cloud_info: Device ID: 504d5632-3736-4649-808b-2321133ae3a6<ESC>[0m
(11:55:51.800) SLM: [00:00:00.590,332] <ESC>[0m<inf> nrf_cloud_info: IMEI:      350457793853522<ESC>[0m
(11:55:51.800) SLM: [00:00:00.682,617] <ESC>[0m<inf> nrf_cloud_info: UUID:      504d5632-3736-4649-808b-2321133ae3a6<ESC>[0m
(11:55:51.800) SLM: [00:00:00.688,354] <ESC>[0m<inf> nrf_cloud_info: Modem FW:  mfw_nrf9160_1.3.7<ESC>[0m
(11:55:51.832) SLM: [00:00:00.688,385] <ESC>[0m<inf> nrf_cloud_info: Protocol:          MQTT<ESC>[0m
(11:55:51.832) SLM: [00:00:00.688,415] <ESC>[0m<inf> nrf_cloud_info: Download protocol: HTTPS<ESC>[0m
(11:55:51.832) SLM: [00:00:00.688,446] <ESC>[0m<inf> nrf_cloud_info: Sec tag:           16842753<ESC>[0m
(11:55:51.832) SLM: [00:00:00.688,476] <ESC>[0m<inf> nrf_cloud_info: Host name:         mqtt.nrfcloud.com<ESC>[0m
(11:55:51.857) SLM: [00:00:00.688,568] <ESC>[0m<inf> slm_uart_handler: UART baud: 115200 d/p/s-bits: 3/0/1 HWFC: 0<ESC>[0m
(11:55:51.857) SLM: [00:00:00.688,720] <ESC>[0m<dbg> slm_at_host: slm_at_send_indicate:TX
(11:55:51.857) SLM:                                       52 65 61 64 79 0d 0a                          |Ready..          <ESC>[0m
(11:55:51.857) SLM: [00:00:00.688,720] <ESC>[0m<inf> slm_at_host: at_host init done<ESC>[0m[00:00:00.688,781] <ESC>[0m<dbg> slm: configure_power_pin_interrupt: Configured interrupt (0x3400000) on power pin (6).<ESC>[0m
(11:56:00.334) AT%MDMEV=1
(11:56:00.542) AT+CGDCONT=1,"IP","gigsky-02"
(11:56:00.734) AT%XSYSTEMMODE=1,1,0,0
(11:56:00.942) AT%XSIM=1
(11:56:01.134) AT+CFUN=1
.(11:56:01.342) AT+CGSN=1
(11:56:01.560) AT+CEMODE?
(11:56:01.743) +CGSN: "350457793853
(11:56:01.743) +CEMODE: 2
(11:56:01.743) AT+CMEE=1
(11:56:01.950) AT+CNEC=24
(11:56:02.142) AT+CIND=1,1,1
(11:56:02.351) %XSIM: 1
(11:56:02.351) AT+CEREG=5                         
(11:56:02.581) AT%CESQ=1
(11:56:02.750) SLM:        0d 0a 4f 4b
(11:56:02.750) AT+CSCON=1
(11:56:02.958) AT+COPS=0
(11:56:03.167) %MDMEV: SEARCH STATUS 2
(11:56:03.167) AT+CGEREP=1
(11:56:08.110) %CESQ: 39,1,10,1
(11:56:08.110) +CEREG: 2,"1812","0033751E",7
(11:56:08.110) %CESQ: 35,1,4,0
(11:56:08.398) +CSCON: 1
(11:56:10.382) +CGEV: ME PDN ACT 0,0
(11:56:10.477) +CNEC_ESM: 50,0
(11:56:10.510) +CEREG: 5,"1812","0033751E",7,,,"11100000","11100000"

The application is programmed using the basic SLM code, with a couple of config changes to output the debug to UART and disable the mx25r64 external flash in the overlay (no idea if I needed to, but I just have the NRF chip and an ATMEGA on my PCB).  The code is built in VSCode from the example application nrf9160dk_nrf9160_ns - no idea whether the NS version is correct, again I got it to compile so called it good.  The "SLM:" prefix in the logs above is to help me see what comes from the NRF and what is printed by the ATMEGA chip.

The application is running from a PSU rather than batteries - I don't think it will be a brown-out, but interested if there's a way to check?

(09:05:49.531) AT%HWVERSION
(09:05:49.738) %HWVERSION: nRF9160 SICA B1A
(09:05:49.738) AT%SHORTSWVER
(09:05:49.947) %SHORTSWVER: nrf9160_1.3.7

I don't have a Segger programmer/RTT, but could maybe justify one if it would help diagnose the issue.

Can you please suggest any likely causes and whether there are any config flags I could set to help diagnose the issue?

Thank you for any assistance.

Parents
  • Hi Hakon,

    I've not seen the reset again following the change to print out the RESETREAS register unfortunately.

    I have however seen the other issue lots - I think there's a problem in the SLM, or down to the way I'm using it.  It appears to get stuck, perhaps there's a timeout somewhere I could change?

    The general pattern is to request the temperature, then send an MQTT publish, for example;

    (02:06:45.772) AT%XTEMP?
    (02:06:45.806) CPU Temperature: 23 deg
    (02:06:45.806) OK
    (02:06:47.772) AT#XMQTTPUB="sensor/255/cpu_temp/C","23",0,0
    (02:06:47.794) OK

    Then after some time, there is no response to the AT#MQTTPUB command.  The reason for UART_RX_RDY failure appears to be that the buffer is full because I had continued to send AT commands - so I have changed the code to wait for OK/ERROR response instead.

    Now after some time waiting for OK, there is a heap overflow where the CESQ isn't output by the SLM but is stored internally;

    (02:44:17.964) SLM: [04:06:22.879,882] <ESC>[1;33m<wrn> at_monitor: No heap space for incoming notification: %CESQ: 28,1,11,1
    (02:44:28.236) SLM: [04:06:33.144,195] <ESC>[1;33m<wrn> at_monitor: No heap space for incoming notification: %CESQ: 29,1,3,0
    (02:44:38.860) SLM: [04:06:43.771,484] <ESC>[1;33m<wrn> at_monitor: No heap space for incoming notification: %CESQ: 29,1,13,1
    ...

    Then, after some hours (of nothing being output rather than the heap error), the OK/ERROR finally arrives and notifications start coming again;

    (05:56:02.744) #XMQTTEVT: 1,-113
    (05:56:02.788) ERROR
    (05:56:02.788) ERROR
    (05:56:02.788) %XTEMP: 21
    (05:56:02.788) CPU Temperature: 21 deg
    (05:56:02.788) OK
    (05:56:02.788) %CESQ: 28,1,7,1
    

    I've looked at the SLM code and can't see an obvious reason for the pause... but then I don't really understand it.  Also how can I get to see the MQTT error output via UART (eg "NET_ERR"), which might help?

    Thanks

Reply
  • Hi Hakon,

    I've not seen the reset again following the change to print out the RESETREAS register unfortunately.

    I have however seen the other issue lots - I think there's a problem in the SLM, or down to the way I'm using it.  It appears to get stuck, perhaps there's a timeout somewhere I could change?

    The general pattern is to request the temperature, then send an MQTT publish, for example;

    (02:06:45.772) AT%XTEMP?
    (02:06:45.806) CPU Temperature: 23 deg
    (02:06:45.806) OK
    (02:06:47.772) AT#XMQTTPUB="sensor/255/cpu_temp/C","23",0,0
    (02:06:47.794) OK

    Then after some time, there is no response to the AT#MQTTPUB command.  The reason for UART_RX_RDY failure appears to be that the buffer is full because I had continued to send AT commands - so I have changed the code to wait for OK/ERROR response instead.

    Now after some time waiting for OK, there is a heap overflow where the CESQ isn't output by the SLM but is stored internally;

    (02:44:17.964) SLM: [04:06:22.879,882] <ESC>[1;33m<wrn> at_monitor: No heap space for incoming notification: %CESQ: 28,1,11,1
    (02:44:28.236) SLM: [04:06:33.144,195] <ESC>[1;33m<wrn> at_monitor: No heap space for incoming notification: %CESQ: 29,1,3,0
    (02:44:38.860) SLM: [04:06:43.771,484] <ESC>[1;33m<wrn> at_monitor: No heap space for incoming notification: %CESQ: 29,1,13,1
    ...

    Then, after some hours (of nothing being output rather than the heap error), the OK/ERROR finally arrives and notifications start coming again;

    (05:56:02.744) #XMQTTEVT: 1,-113
    (05:56:02.788) ERROR
    (05:56:02.788) ERROR
    (05:56:02.788) %XTEMP: 21
    (05:56:02.788) CPU Temperature: 21 deg
    (05:56:02.788) OK
    (05:56:02.788) %CESQ: 28,1,7,1
    

    I've looked at the SLM code and can't see an obvious reason for the pause... but then I don't really understand it.  Also how can I get to see the MQTT error output via UART (eg "NET_ERR"), which might help?

    Thanks

Children
  • Can I see the code? Also, if possible can you please capture a modem trace of all relevant errors?

  • Hi Hakon,

    I think I'm understanding where the issue comes from.  I have now seen 24h running, and it looks like the original problem was caused by my misunderstanding that the SLM application can only receive one command at a time, and when there was an issue I was overloading it with repeated commands.  I changed my code to wait for OK/ERROR responses and it is a lot more stable.

    However, you recall from my last message that it sometimes takes a long time for the MQTT PUB to respond, this is still an issue.

    When the MQTT delay occurs, the rest of the SLM appears to run in the background but the messages aren't output until the command is complete (eg after 1h).  I have therefore disabled some debugging (hex output) and turned off %CESQ messages.  This has removed corrupted messages that were being received by the Arduino after some time.

    Please see the below log:

    (01:56:55.953) AT#XMQTTPUB="sensor/255/battery/V","11.66",0,0
    (01:58:55.932) Stuck cmd: AT#XMQTTPUB="sensor/255/battery/V","11.66",0,0		<----- (No response received)
    (02:39:46.635) #XMQTTEVT: 1,-113
    (02:39:46.635) MQTT Disconnected.ERROR
    (02:39:46.635) 
    (02:39:46.635) +CEREG: 5,"33EF","0031E215",9,,,"11100000","00101000"
    (02:39:46.635) Connected
    (02:39:46.665) +CEREG: 5,"1BFA","0036821F",9,,,"11100000","00101000"
    (02:39:46.665) Connected
    (02:39:46.665) +CEREG: 5,"33EF","0031E215",9,,,"11100000","00101000"
    (02:39:46.665) Connected
    (02:39:46.665) +CEREG: 5,"1BFA","0036821F",9,,,"11100000","00101000"
    (02:39:46.678) Connected
    (02:39:46.678) +CEREG: 5,"33EF","0031E215",9,,,"11100000","00101000"
    (02:39:46.678) Connected
    (02:39:46.678) +CEREG: 5,"1BFA","0036821F",9,,,"11100000","00101000"
    (02:39:46.678) Connected
    (02:39:46.678) +CEREG: 5,"33EF","0031E215",9,,,"11100000","00101000"
    (02:39:46.695) Connected
    (02:39:46.695) +CEREG: 5,"1BFA","0036821F",9,,,"11100000","00101000"
    (02:39:46.695) Connected
    (02:39:46.695) +CEREG: 5,"33EF","0031E215",9,,,"11100000","00101000"
    (02:39:46.695) Connected
    (02:39:46.695) +CEREG: 5,"1BFA","0036821F",9,,,"11100000","00101000"
    (02:39:46.710) Connected
    (02:39:46.710) +CEREG: 5,"33EF","0031E215",9,,,"11100000","00101000"
    (02:39:46.710) Connected
    (02:39:46.710) +CEREG: 5,"1812","0033751E",7,,,"11100000","00101000"
    (02:39:46.710) Connected
    (02:39:46.710) +CGEV: ME PDN DEACT 0
    (02:39:46.730) +CNEC_EMM: 9
    (02:39:46.730) +CEREG: 2,"1812","0033751E",7,0,9
    (02:39:46.730) Not Connected: 2
    (02:39:47.131) AT%XTEMP?
    (02:39:47.149) CPU Temperature: 30 deg
    (02:39:47.149) OK
    (02:39:47.643) 
    (02:39:47.643) SLM: [10:00:56.345,458] <ESC>[1;31m<err> slm_at_host: AT command failed: -113<ESC>[0m
    (02:39:47.643) SLM: [10:00:56.345,733] <ESC>[1;31m<err> slm_mqtt: ERROR: mqtt_live -128<ESC>[0m
    (02:39:47.643) SLM: [10:00:56.345,733] <ESC>[0m<inf> slm_mqtt: MQTT thread terminated<ESC>[0m
    (02:39:48.139) AT%XTEMP?
    (02:39:48.139) CPU Temperature: 30 deg
    (02:39:48.139) OK

    In it you can see that XMQTTPUB did not receive an OK/ERROR for >40 minutes.  It was sent at 01:56:55 and the ERROR was received at 02:39.46.  In that time no other commands were sent to the SLM.

    Once the SLM started responding, lots of CEREG messages come in.  I believe these were buffered by the SLM but couldn't be sent earlier because the MQTT was blocking.

    In this case the SLM application was effectively stuck for 40minutes even though there was a valid connection (CEREG: 5).  There must either be a faulty timeout in the SLM, or some kind of bug.  There is no other debug code.

    I'm unable to get any extra debug/monitoring because this application only has UART, no RTT.  I can try to add RTT but unsure of the process.  As I said before, I'm happy to buy a Segger, but unsure how to implement it.  I'm happy to add the full log, but I don't think there's much info in there that will help - there are limited SLM debug messages and it basically shows it sending PUB commands for 24h (~2000 commands) before one fails, which is in the log above.

    Is there a reason the SLM blocks for such a large time, or is there a delay that can be shorted?

    Thanks

  • Just to follow up on this, I don't think it's a network issue.  I've added code to detect the stuck command (no OK/ERROR within 2 minutes) and reboot the NRF.  The rebooted SLM application connects to the network and starts working again straight away.  If I go back to waiting for the OK/ERROR response from the MQTTPUB, it can take anywhere between 40mins and a couple of hours to resume working,

  • Would it be possible to capture a modem trace (PCAP)? That could reveal why you get the MQTT disconnect.

  • I can but the issue is I don't know how.

    I have some pins broken out to headers that I can collect data from (P0.26, P0.30-31, COEX0-2, VIO/SCLK/SDATA, MAGPIO0-2).  Could it be configured to use on any of those?  I assume then a TTL/USB converter & collect the data via a serial port and the trace app?

Related