(2.6.1 update) lte_lc_connect_async crashes [Illegal use of the EPSR]

Migrating to 2.6.1 we have everything working and the project builds. But as soon as lte_lc_connect_async is called the app now crashes. The Faulting instruction address seems like it might be datetime related? I have tried doubling the main thread stack size, the modem thread stack size, and the system queue stack size but nothing seems to have changed it. If I comment out the LTE connection the app runs normally (but doesn't connect to LTE).

[00:00:31.390,899] <wrn> modem: Functional mode changed to 1
[00:00:31.391,693] <inf> app_event_manager: MODEM_EVT_LTE_CONNECTING
[00:00:31.392,517] <wrn> modem: -><- LTE CONNECTING....
[00:00:31.434,234] <err> os: ***** USAGE FAULT *****
[00:00:31.434,234] <err> os:   Illegal use of the EPSR
[00:00:31.434,265] <err> os: r0/a1:  0x200213a8  r1/a2:  0x200139c0  r2/a3:  0x200139c0
[00:00:31.434,295] <err> os: r3/a4:  0x0002b800 r12/ip:  0x0ccccccc r14/lr:  0x00029edb
[00:00:31.434,326] <err> os:  xpsr:  0x60000000
[00:00:31.434,356] <err> os: s[ 0]:  0x00000000  s[ 1]:  0x20021398  s[ 2]:  0x20021398  s[ 3]:  0x00029a11
[00:00:31.434,356] <err> os: s[ 4]:  0x2002136f  s[ 5]:  0x7959db2c  s[ 6]:  0x008739b0  s[ 7]:  0x0000f750
[00:00:31.434,387] <err> os: s[ 8]:  0x00000000  s[ 9]:  0x00000000  s[10]:  0x00000000  s[11]:  0x00000000
[00:00:31.434,417] <err> os: s[12]:  0xffffffff  s[13]:  0xffffffff  s[14]:  0x00000000  s[15]:  0x00000000
[00:00:31.434,417] <err> os: fpscr:  0x00000000
[00:00:31.434,417] <err> os: Faulting instruction address (r15/pc): 0x0002b800
[00:00:31.434,478] <err> os: >>> ZEPHYR FATAL ERROR 35: Unknown error on CPU 0
[00:00:31.434,509] <err> os: Current thread: 0x200139c0 (sysworkq)
[00:00:32.131,866] <err> fatal_error: Resetting system

Zephyr Map

.text.date_time_core_notify_event
                0x000000000002b868       0x1c modules/nrf/lib/date_time/lib..__nrf__lib__date_time.a(date_time_core.c.obj)
 .text.date_time_lte_ind_handler
                0x000000000002b884       0x38 modules/nrf/lib/date_time/lib..__nrf__lib__date_time.a(date_time_core.c.obj)
                0x000000000002b884                date_time_lte_ind_handler
 .text.date_time_core_schedule_update
                0x000000000002b8bc       0x54 modules/nrf/lib/date_time/lib..__nrf__lib__date_time.a(date_time_core.c.obj)

  • I finally made some big progress. I used the amr-zephyr-eabi-addr2line.exe program to work back the fault. I finally got this:

    arm-zephyr-eabi-addr2line.exe -e ./build/debug/zephyr/zephyr.elf -a 0x00025ab7
    0x00025ab7
    C:/ncs/v2.6.1/nrf/lib/lte_link_control/lte_lc_helpers.c:124 (discriminator 343)

    If I comment out the SYS_SLIST_FOR_EACH_CONTAINER_SAFE macro (see below) in the function "event_handler_list_dispatch" the app now connects to lte and works normally. As far as I can tell this seems to be unchanged from SDK 2.5.1 so I don't see how its now causing a problem... something deeper in fails with slist peek_next().

    void event_handler_list_dispatch(const struct lte_lc_evt *const evt)
    {
    	struct event_handler *curr, *tmp;
    
    	if (event_handler_list_is_empty()) {
    		return;
    	}
    
    	k_mutex_lock(&list_mtx, K_FOREVER);
    
    	/* Dispatch events to all registered handlers */
    	LOG_DBG("Dispatching event: type=%d", evt->type);
    	// SYS_SLIST_FOR_EACH_CONTAINER_SAFE(&handler_list, curr, tmp, node) {
    	// 	LOG_DBG(" - handler=0x%08X", (uint32_t)curr->handler);
    	// 	curr->handler(evt);
    	// }
    	LOG_DBG("Done");
    
    	k_mutex_unlock(&list_mtx);
    }

  • To add to this: with those lines commented out I get good LTE connections now but I no longer get the datetime updated after connecting to LTE which was one of my original hunches as a potential cause of the fault messages. So some event sent by that macro for datetime updates is leading to the kernel crash.

  • Hi Colin,

    Ok, it seems the cause is from the nRF library, but I feel not confident when you start to modify the library codes.

    Have you ported the UDP sample to your custom board to have a try? Or test your minimal codes that can repeat this issue on a nRF9160DK?

    These tests will help to identify if it is the hardware to make the difference.

    Best regards,

    Charlie

  • I have finally resolved this but I don't fully understand what was wrong. I will put my best guess here in case anyone else finds this and has a similar issue.

    My device also runs BLE. When the device boots up it reads the stored BLE name from persistent memory and triggers an event to let the BLE module know it can start advertising. However, the function in my ble module that receives this event uses a k_work_schedule with a 3 second delay.

    Even though the BLE starts way before (30 seconds) the LTE connection attempt and this delayable should be long gone it was somehow leading to the ESPR issue. If I remove the k_work_schedule for starting BLE everything works normally now. I have now reworked the code to make sure everything is initialized instead of using the arbitrary 3 second delay which has allowed me to remove the k_work_schedule and resolved the issues.

    I find it very strange that this code would have impacted anything (especially an unrelated call happening 30 seconds later on a different thread). It's almost like there was a memory leak or corruption from the k_work_schedule. Very confusing.

  • Hi Colin,

    Thanks for the update. Yes, it is very strange.

    I just wonder, have you done something time-consuming or even blocking in your work task function. Have you used log to check when it is actually finished?

    Best regards,

    Charlie 

Related