NCS 3.2.1 - crashs with assert in "nrf_modem_os_timedwait() - k_sem_take"

nRF9160-DK, mfw 1.3.7, NCS 3.2.1

A UDP client (CoAP) exchanges data with "sendto( ... MSG_DONTWAIT)", "poll(...)" and "recvfrom(...). It was working with several NCS versions, including 2.6.4, 2.9.2 and 3.1.2.

But migrating to NCS 3.2.1 it seems to crash with an assert on the second call of sendto. I already enlarged the main stack, but without success.

Using the debugger shows the related callstack:

assert_post_action(const char * file, unsigned int line) (/home/achim/ncs/v3.2.1/zephyr/lib/os/assert.c:43)
z_swap_irqlock(unsigned int key) (/home/achim/ncs/v3.2.1/zephyr/kernel/include/kswap.h:210)
z_swap_irqlock(unsigned int key) (/home/achim/ncs/v3.2.1/zephyr/kernel/include/kswap.h:202)
z_impl_k_sem_take(struct k_sem * sem, k_timeout_t timeout) (/home/achim/ncs/v3.2.1/zephyr/kernel/sem.c:158)
k_sem_take(struct k_sem * sem) (/home/achim/ncs/v3.2.1/coaps-client/build_nrf9160dk_nrf9160_ns/coaps-client/zephyr/include/generated/zephyr/syscalls/kernel.h:1158)
nrf_modem_os_timedwait(uint32_t context, int32_t * timeout) (/home/achim/ncs/v3.2.1/nrf/lib/nrf_modem_lib/nrf_modem_os.c:201)
_req_forward (Unknown Source:0)
nrf_sendto (Unknown Source:0)
nrf9x_socket_offload_sendto(void * obj, const void * buf, size_t len, int flags, const struct sockaddr * to, socklen_t tolen) (/home/achim/ncs/v3.2.1/nrf/lib/nrf_modem_lib/nrf9x_sockets.c:702)
z_impl_zsock_sendto(int sock, const void * buf, size_t len, int flags, const struct sockaddr * dest_addr, socklen_t addrlen) (/home/achim/ncs/v3.2.1/zephyr/subsys/net/lib/sockets/sockets.c:342)
zsock_sendto(socklen_t addrlen, const struct sockaddr * dest_addr, int flags, size_t len, const void * buf, int sock) (/home/achim/ncs/v3.2.1/coaps-client/build_nrf9160dk_nrf9160_ns/coaps-client/zephyr/include/generated/zephyr/syscalls/socket.h:260)
send_to_peer(dtls_app_data_t * app, const uint8_t * data, size_t len) (/home/achim/ncs/v3.2.1/coaps-client/src/dtls_client.c:1194)
recvfrom_peer(dtls_app_data_t * app) (/home/achim/ncs/v3.2.1/coaps-client/src/dtls_client.c:1381)
dtls_loop(dtls_app_data_t * app, int reboot) (/home/achim/ncs/v3.2.1/coaps-client/src/dtls_client.c:2376)
main() (/home/achim/ncs/v3.2.1/coaps-client/src/dtls_client.c:3110) 

The assert, which fires, is: 

 __ASSERT(arch_irq_unlocked(key) ||
 _current->base.thread_state & (_THREAD_DUMMY | _THREAD_DEAD),
 "Context switching while holding lock!");

I don't see, how the application code is able to cause that assert.

Any ideas?

Parents
  • Hello,

    The spin lock validation was added after SDK v3.1.x by this commit: https://github.com/nrfconnect/sdk-zephyr/commit/5183fc5693832b6aa1fe1cab06e628681300d0fa. Does your application implement any custom driver / declare any interrupt handlers directly?

    Best regards,

    Vidar

  • Does your application implement any custom driver

    No.

    declare any interrupt handlers directly?

    I use a GPIO interrupt with "gpio_init_callback", not sure, if that is "directly". 

  • In your build folder under the following path: build/<application name>/zephyr/.config

  • I already found it. I just zipped it and added a z infront, because uploading ".config" was refused.

  • Thanks. Did not find any potential problematic settings in the config. Are you by any chance calling irq_lock()/arch_irq_lock()/__set_BASEPRI() in your code? Please also try profiling the stack usage in your threads by adding the following to your project configuration:

    CONFIG_THREAD_ANALYZER=y                                                                           
    CONFIG_THREAD_ANALYZER_AUTO=y                                                                      
    CONFIG_THREAD_ANALYZER_AUTO_INTERVAL=5                                                             
    CONFIG_THREAD_NAME=y

    There should be at least 128 bytes of headroom on each stack. You may also enable CONFIG_MPU_STACK_GUARD=y in addition to builtin stack guard.

  • I've added the configs above. Without 

    CONFIG_SPIN_VALIDATE=n

    I only see on analyzer output and it crashes 2s after that.

    With CONFIG_SPIN_VALIDATE=n it runs and the below output is the one after a 30s and a couple of send messages:

    Thread analyze:
     trace_thread        : STACK: unused 528 usage 240 / 768 (31 %); CPU: 1 %
                         : Total CPU cycles used: 12068
     thread_analyzer     : STACK: unused 480 usage 544 / 1024 (53 %); CPU: 0 %
                         : Total CPU cycles used: 921
     shutdown            : STACK: unused 1856 usage 192 / 2048 (9 %); CPU: 0 %
                         : Total CPU cycles used: 0
     cmd_workq           : STACK: unused 640 usage 1408 / 2048 (68 %); CPU: 0 %
                         : Total CPU cycles used: 190
     io_workq            : STACK: unused 924 usage 1124 / 2048 (54 %); CPU: 0 %
                         : Total CPU cycles used: 74
     uart_workq          : STACK: unused 968 usage 184 / 1152 (15 %); CPU: 0 %
                         : Total CPU cycles used: 5
     sh_cmd_workq        : STACK: unused 1864 usage 184 / 2048 (8 %); CPU: 0 %
                         : Total CPU cycles used: 1
     lte_lc_work_q       : STACK: unused 840 usage 184 / 1024 (17 %); CPU: 0 %
                         : Total CPU cycles used: 0
     sysworkq            : STACK: unused 1112 usage 936 / 2048 (45 %); CPU: 0 %
                         : Total CPU cycles used: 134
     logging             : STACK: unused 1632 usage 416 / 2048 (20 %); CPU: 0 %
                         : Total CPU cycles used: 3886
     idle                : STACK: unused 336 usage 48 / 384 (12 %); CPU: 96 %
                         : Total CPU cycles used: 1102218
     main                : STACK: unused 5080 usage 3112 / 8192 (37 %); CPU: 1 %
                         : Total CPU cycles used: 21338
     ISR0                : STACK: unused 1592 usage 456 / 2048 (22 %)

  • I also tested with disabled modem trace, but that crashes as well.

Reply Children
  • You appear to have plenty of headroom in your stacks, which is good. However, it is interesting that the assertion is this consistent and happening when calling sendto(). I have searched through the code but I don't see any locks being acquired that could explain what you are seeing. Are you able to debug and check what the exact key value is when the assertion is raised?  This might be easiest by temporarily patching the kswap code to trap the CPU when the key value is not 0. 

    E.g.,

    if(key != 0) {
        __ASM("nop"); // <-- place breakpoint here
        while(1);
    }

  • Are you able to debug and check what the exact key value is when the assertion is raised?

    I will try, but if I remember well, it's "optimized out" ...

    ...

    The "key" is 32.

    And the _current->base.thread_state is 2, which is defined as


    /* Thread is waiting on an object */
    #define _THREAD_PENDING (BIT(1))

  • Thanks. I think we can be fairly certain now that this is not caused by any memory corruption. However, as mentioned earlier, I cannot find anything in our source code that would explain why the lock is already held when this thread is suspended. Are you debugging in VS code with debug optmizations enabled? If so could you post a picture of the call stacks?

    EDIT: see answer from Bjarki. I did not include the coaps client library when looking for irq locks earlier.

Related