(2.6.1 update) lte_lc_connect_async crashes [Illegal use of the EPSR]

Migrating to 2.6.1 we have everything working and the project builds. But as soon as lte_lc_connect_async is called the app now crashes. The Faulting instruction address seems like it might be datetime related? I have tried doubling the main thread stack size, the modem thread stack size, and the system queue stack size but nothing seems to have changed it. If I comment out the LTE connection the app runs normally (but doesn't connect to LTE).

[00:00:31.390,899] <wrn> modem: Functional mode changed to 1
[00:00:31.391,693] <inf> app_event_manager: MODEM_EVT_LTE_CONNECTING
[00:00:31.392,517] <wrn> modem: -><- LTE CONNECTING....
[00:00:31.434,234] <err> os: ***** USAGE FAULT *****
[00:00:31.434,234] <err> os:   Illegal use of the EPSR
[00:00:31.434,265] <err> os: r0/a1:  0x200213a8  r1/a2:  0x200139c0  r2/a3:  0x200139c0
[00:00:31.434,295] <err> os: r3/a4:  0x0002b800 r12/ip:  0x0ccccccc r14/lr:  0x00029edb
[00:00:31.434,326] <err> os:  xpsr:  0x60000000
[00:00:31.434,356] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x20021398  s[ 2]:  0x20021398  s[ 3]:  0x00029a11
[00:00:31.434,356] <err> os: s[ 4]:  0x2002136f  s[ 5]:  0x7959db2c  s[ 6]:  0x008739b0  s[ 7]:  0x0000f750
[00:00:31.434,387] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:31.434,417] <err> os: s[12]:  0xffffffff  s[13]:  0xffffffff  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:31.434,417] <err> os: fpscr:  0x00000000
[00:00:31.434,417] <err> os: Faulting instruction address (r15/pc): 0x0002b800
[00:00:31.434,478] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
[00:00:31.434,509] <err> os: Current thread: 0x200139c0 (sysworkq)
[00:00:32.131,866] <err> fatal_error: Resetting system

Zephyr Map

.text.date_time_core_notify_event
                0x000000000002b868       0x1c modules/nrf/lib/date_time/lib..__nrf__lib__date_time.a(date_time_core.c.obj)
 .text.date_time_lte_ind_handler
                0x000000000002b884       0x38 modules/nrf/lib/date_time/lib..__nrf__lib__date_time.a(date_time_core.c.obj)
                0x000000000002b884                date_time_lte_ind_handler
 .text.date_time_core_schedule_update
                0x000000000002b8bc       0x54 modules/nrf/lib/date_time/lib..__nrf__lib__date_time.a(date_time_core.c.obj)

Parents
  • Hi Colin,

    May I know which NCS version your previous codes are based on? 

    Have you also updated to the latest modem firmware(MFW1.3.6 for nRF9160 at current time)?

    Best regards,

    Charlie

  • Hi Charlie,

    I did a bunch of testing today. We are using MFW1.3.4 but it doesn't seem to be related to this. I was able to load the lte_ble_gateway sample on the same device (built with 2.6.1 and it works fine) + some other info below.

    I was able to trace the error to this line:

    lte_lc.c (line 546)
    SDK v2.6.1

    /* +CSCON notifications */
    err = nrf_modem_at_printf(cscon);
    if (err) {
    	LOG_WRN("Failed to enable RRC notifications (+CSCON), error %d", err);
    	return -EFAULT;
    }

    SDK v2.5.1

    /* +CSCON notifications */
    err = nrf_modem_at_printf(cscon);
    if (err) {
    	char buf[50];
    
    	/* AT+CSCON is supported from modem firmware v1.1.0, and will
    	 * not work for older versions. If the command fails, RRC
    	 * mode change notifications will not be received. This is not
    	 * considered a critical error, and the error code is therefore
    	 * not returned, while informative log messages are printed.
    	 */
    	LOG_WRN("AT+CSCON failed (%d), RRC notifications are not enabled", err);
    	LOG_WRN("AT+CSCON is supported in nRF9160 modem >= v1.1.0");
    
    	err = nrf_modem_at_cmd(buf, sizeof(buf), "AT+CGMR");
    	if (err == 0) {
    		char *end = strstr(buf, "\r\nOK");
    
    		if (end) {
    			*end = '\0';
    		}
    
    		LOG_WRN("Current modem firmware version: %s", buf);
    	}
    }
    
    return 0;

    If I comment that command it seems to connect and run stable. But there is still some strange behaviour I can't quite figure out. It almost seems like the LOG_WRN calls are leading to the crash? But as far as I know it is not returning an error so those shouldn't even be called.

    If I leave just err = nrf_modem_at_printf(cscon); and don't check the error it seems to run stable...

    Very confusing but hopefully on the right track.

    I also tried sending AT+CSCON=1 command manually and it returned OK and didn't crash the app.

  • Hi Colin,

    Good to see so much useful information.

    CRD said:
    I was able to load the lte_ble_gateway sample on the same device (built with 2.6.1 and it works fine) + some other info below.

    Do you mean the same issue works fine with NCSv2.6.1 lte_ble_gateway sample on your custom device?

    The codes you shared are related to your custom codes?

    CRD said:
    If I leave just err = nrf_modem_at_printf(cscon); and don't check the error it seems to run stable...

    This also sounds very weird to me. Could you repeat this issue with your codes on a nRF9160DK? Just share a minimal part of your codes that can repeat this issue if you can.

    Best regards,

    Charlie

  • Okay I have found out more. (my app is based off the aws asset tracked v2 project btw).

    When I was connecting previously it was after a delay. So:

    • main.c was calling nrf_modem_lib_init() on bootup
    • After 30 seconds the modem module thread would call lte_lc_connect_async
    • This crashes immediately after changing the CFUN to 1

    During my testing I modified the code (I tried running them from the same thread, from main etc etc..) but at some point I ran the lte_lc_connect_async() at bootup time immediately after initializing the modem. This connects to the network..... but then crashes after about 30 seconds the same way.

    I have tried recreating this issue in the lte_ble_gateway sample, and even copied all my modem module code into it but strangely it works without any issues in that case... This sample isn't running multiple module threads so maybe that is part of the issue?

    I added the AT_HOST_LIBRARY=y so I could log out the AT commands that are being sent during connection and before the crash through the RTT logs:

    LTE CONNECTING....
    %CESQ: 34,1,0,0
    modem_rsrp_handler: Incoming RSRP status message, RSRP value is -106
    +CEREG: 0
    lte_evt_handler: LTE cell changed: Cell ID: -1, Tracking area: -1
    +CEREG: 2,"<hidden>","<hidden>",7
    lte_evt_handler: PSM parameter update: TAU: 2, Active time: 0
    lte_evt_handler: LTE cell changed: Cell ID: <hidden>, Tracking area: <hidden>
    +CSCON: 1
    lte_evt_handler: RRC mode: Connected
    %CESQ: 41,2,4,0
    modem_rsrp_handler: Incoming RSRP status message, RSRP value is -99
    +CGEV: ME PDN ACT 0,0
    modem: PDN LTE CONNECTED
    +CNEC_ESM: 50,0
    pdn_event_handler: Event: PDP context 0, PDN type IPv4 only allowed
    %MDMEV: SEARCH STATUS 2
    +CEREG: 5,"<hidden>","<hidden>",7,,,"00001010","11000001"
    lte_evt_handler: Network registration status: Connected - roaming
    lte_evt_handler: PSM parameter update: TAU: 1152000, Active time: 20
    %XTIME: ,"42504191339469","01"
    
    (About 10 seconds pass)
    
    %CESQ: 255,0,255,0
    +CSCON: 0
    [00:00:27.201,293] <err> os: ***** USAGE FAULT *****
    [00:00:27.201,324] <err> os:   Illegal use of the EPSR
    [00:00:27.201,354] <err> os: r0/a1:  0x20020ff8  r1/a2:  0x20013aa8  r2/a3:  0x20013aa8
    [00:00:27.201,354] <err> os: r3/a4:  0x0002b000 r12/ip:  0x0ccccccc r14/lr:  0x00029c07
    [00:00:27.201,385] <err> os:  xpsr:  0x60000000
    [00:00:27.201,385] <err> os: s[ 0]:  0x00000003  s[ 1]:  0x2001cb08  s[ 2]:  0x2001cb04  s[ 3]:  0x00029317
    [00:00:27.201,416] <err> os: s[ 4]:  0x00000003  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
    [00:00:27.201,446] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
    [00:00:27.201,477] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
    [00:00:27.201,477] <err> os: fpscr:  0x00000000
    [00:00:27.201,507] <err> os: Faulting instruction address (r15/pc): 0x0002b000
    [00:00:27.201,538] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
    [00:00:27.201,599] <err> os: Current thread: 0x20013aa8 (sysworkq)
    [00:00:27.733,703] <err> fatal_error: Resetting system

Reply
  • Okay I have found out more. (my app is based off the aws asset tracked v2 project btw).

    When I was connecting previously it was after a delay. So:

    • main.c was calling nrf_modem_lib_init() on bootup
    • After 30 seconds the modem module thread would call lte_lc_connect_async
    • This crashes immediately after changing the CFUN to 1

    During my testing I modified the code (I tried running them from the same thread, from main etc etc..) but at some point I ran the lte_lc_connect_async() at bootup time immediately after initializing the modem. This connects to the network..... but then crashes after about 30 seconds the same way.

    I have tried recreating this issue in the lte_ble_gateway sample, and even copied all my modem module code into it but strangely it works without any issues in that case... This sample isn't running multiple module threads so maybe that is part of the issue?

    I added the AT_HOST_LIBRARY=y so I could log out the AT commands that are being sent during connection and before the crash through the RTT logs:

    LTE CONNECTING....
    %CESQ: 34,1,0,0
    modem_rsrp_handler: Incoming RSRP status message, RSRP value is -106
    +CEREG: 0
    lte_evt_handler: LTE cell changed: Cell ID: -1, Tracking area: -1
    +CEREG: 2,"<hidden>","<hidden>",7
    lte_evt_handler: PSM parameter update: TAU: 2, Active time: 0
    lte_evt_handler: LTE cell changed: Cell ID: <hidden>, Tracking area: <hidden>
    +CSCON: 1
    lte_evt_handler: RRC mode: Connected
    %CESQ: 41,2,4,0
    modem_rsrp_handler: Incoming RSRP status message, RSRP value is -99
    +CGEV: ME PDN ACT 0,0
    modem: PDN LTE CONNECTED
    +CNEC_ESM: 50,0
    pdn_event_handler: Event: PDP context 0, PDN type IPv4 only allowed
    %MDMEV: SEARCH STATUS 2
    +CEREG: 5,"<hidden>","<hidden>",7,,,"00001010","11000001"
    lte_evt_handler: Network registration status: Connected - roaming
    lte_evt_handler: PSM parameter update: TAU: 1152000, Active time: 20
    %XTIME: ,"42504191339469","01"
    
    (About 10 seconds pass)
    
    %CESQ: 255,0,255,0
    +CSCON: 0
    [00:00:27.201,293] <err> os: ***** USAGE FAULT *****
    [00:00:27.201,324] <err> os:   Illegal use of the EPSR
    [00:00:27.201,354] <err> os: r0/a1:  0x20020ff8  r1/a2:  0x20013aa8  r2/a3:  0x20013aa8
    [00:00:27.201,354] <err> os: r3/a4:  0x0002b000 r12/ip:  0x0ccccccc r14/lr:  0x00029c07
    [00:00:27.201,385] <err> os:  xpsr:  0x60000000
    [00:00:27.201,385] <err> os: s[ 0]:  0x00000003  s[ 1]:  0x2001cb08  s[ 2]:  0x2001cb04  s[ 3]:  0x00029317
    [00:00:27.201,416] <err> os: s[ 4]:  0x00000003  s[ 5]:  0x00000000  s[ 6]:  0x00000000  s[ 7]:  0x00000000
    [00:00:27.201,446] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
    [00:00:27.201,477] <err> os: s[12]:  0x00000000  s[13]:  0x00000000  s[14]:  0x00000000  s[15]:  0x00000000
    [00:00:27.201,477] <err> os: fpscr:  0x00000000
    [00:00:27.201,507] <err> os: Faulting instruction address (r15/pc): 0x0002b000
    [00:00:27.201,538] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
    [00:00:27.201,599] <err> os: Current thread: 0x20013aa8 (sysworkq)
    [00:00:27.733,703] <err> fatal_error: Resetting system

Children
  • Hi Colin,

    Ok, have you tried to run lte connection actions in a standalone thread? but I think it is not necessary. 

    Our Cellular: UDP — nRF Connect SDK 2.6.1 documentation (nordicsemi.com) sample shows simple usage of lte_lc_connect_async and it is not multi-threads application, could you read through its source codes to get some hint?

    Best regards,

    Charlie

  • I finally made some big progress. I used the amr-zephyr-eabi-addr2line.exe program to work back the fault. I finally got this:

    arm-zephyr-eabi-addr2line.exe -e ./build/debug/zephyr/zephyr.elf -a 0x00025ab7
    0x00025ab7
    C:/ncs/v2.6.1/nrf/lib/lte_link_control/lte_lc_helpers.c:124 (discriminator 343)

    If I comment out the SYS_SLIST_FOR_EACH_CONTAINER_SAFE macro (see below) in the function "event_handler_list_dispatch" the app now connects to lte and works normally. As far as I can tell this seems to be unchanged from SDK 2.5.1 so I don't see how its now causing a problem... something deeper in fails with slist peek_next().

    void event_handler_list_dispatch(const struct lte_lc_evt *const evt)
    {
    	struct event_handler *curr, *tmp;
    
    	if (event_handler_list_is_empty()) {
    		return;
    	}
    
    	k_mutex_lock(&list_mtx, K_FOREVER);
    
    	/* Dispatch events to all registered handlers */
    	LOG_DBG("Dispatching event: type=%d", evt->type);
    	// SYS_SLIST_FOR_EACH_CONTAINER_SAFE(&handler_list, curr, tmp, node) {
    	// 	LOG_DBG(" - handler=0x%08X", (uint32_t)curr->handler);
    	// 	curr->handler(evt);
    	// }
    	LOG_DBG("Done");
    
    	k_mutex_unlock(&list_mtx);
    }

  • To add to this: with those lines commented out I get good LTE connections now but I no longer get the datetime updated after connecting to LTE which was one of my original hunches as a potential cause of the fault messages. So some event sent by that macro for datetime updates is leading to the kernel crash.

Related