ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0

Request

I need help debugging to figure out why I'm getting the infamous ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0Any ideas on what I should investigate next would be greatly appreciated. I'm currently stuck and my deadline is approaching quickly.  My leading theory is that it may be a deadlock from callingbt_gatt_write_without_response() while ATT request queue is full, but I don't know how to confirm that or rule it out.  

Problem Statement

This weekend I ran a test where I was sending the same 574 byte packet every 1.5 seconds over USB CDC to the nRF52840DK's USB device port (i.e. on long side of DK). And my firmware application reset 51 times due to a ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0 in 9 hours and 34 minutes.  The time between resets varied randomly between 4 seconds to 92 minutes with an average of 10.6 minutes.

Log(s)

Attached is my RTT log with the Auto Thread Analyzer set to report every 5 seconds and HEAP usage statistics.  From my analysis, it doesn't seem to be a heap or stack size issue.

4846.2024-04-20 11.14 Debugging Dongle Reset w Thread Analyzer.log

Firmware App Info

My firmware application started from the Multi-NUS application.  This is a Long Range Bluetooth LE USB Dongle application that needs to receive a ~600 byte packet every 1 second from the USB Host (over USB CDC) and send the packet to 3 concurrently connected peripherals over the Coded PHY while asynchronously receiving a 53 byte packet from each of the connected peripherals every 500 ms and send them over USB CDC to the USB Host.

Using the Muti-NUS application as my starting point, I have made the following modifications:

  1. Upgraded Multi-NUS from NCS v1.4.1 to NCS v2.5.0 with help from Wes ... Thanks Wes!
  2. Added USB CDC support using the Peripheral UART sample as a guide
  3. Changed scanning and connecting to only use the Coded PHY (I.E. scans on Coded PHY only for new connections that support the NUS LE service) using the Bluetooth: Central Heart Rate Monitor with Coded PHY as a guide.
  4. Extended nus_client.c to support sending data to the RX characteristic of the NUS server by calling bt_gatt_write_without_response() instead of bt_gatt_write() by adding the following new function.
    int bt_nus_client_send_without_response(struct bt_nus_client *nus_c, const uint8_t *data,
    		       uint16_t len)
    {
    	int err;
    	LOG_DBG("Sending data without response");
    	if (!nus_c->conn) {
    		return -ENOTCONN;
    	}
    
    	if (atomic_test_and_set_bit(&nus_c->state, NUS_C_RX_WRITE_PENDING)) {
    		return -EALREADY;
    	}
    
    	nus_c->rx_write_params.func = on_sent;
    	nus_c->rx_write_params.handle = nus_c->handles.rx;
    	nus_c->rx_write_params.offset = 0;
    	nus_c->rx_write_params.data = data;
    	nus_c->rx_write_params.length = len;
    
    	err = bt_gatt_write_without_response(nus_c->conn, nus_c->handles.rx, data, len, false);
    	if (err) {
    		LOG_ERR("Write without response failed (err %d)", err);
    	}
    
    	on_sent(nus_c->conn, err, &nus_c->rx_write_params);
    	
    	return err;
    }
  5. Increased the BT_NUS_UART_BUFFER_SIZE in the Kconfig file to 1034
  6. Modified multi_nus_send() to broadcast up to MTU size chunks for when incoming USB packets are greater than the MTU size (currently 189, but I'm still tuning).
  7. More than doubled all stack and heap sizes I could find in prj.conf file and Kconfig file
  8. Added reporting of HEAP runtime stats using sys_heap_runtime_stats_get()
  9. Enabled & configured the Thread Analyzer module (auto report interval set to minimum value ... 5 seconds)

Project Upload

3122.Long Range Multi-NUS Dongle Prj.zip

Parents
  • Hi, 

    From the log, 

    This shows that it is your logging thread that is causing the fault.

    In general, you can use arm-none-eabi-addr2line -e build-folder/zephyr/zephyr.elf 0xFAULTINGADDR to resolve the address to a source file and line number. In this case, I suspect the logging thread is running out of memory.

    Could you try adjusting the logging thread stack via this config CONFIG_LOG_PROCESS_THREAD_STACK_SIZECould you try to check the value in the .config under build/zephyr and then increase this and see if this removes the error?

    Regards,
    Amanda H.

  • Hi Amanda,

    Good catch!  I can't believe I glossed over that. 

    CONFIG_LOG_PROCESS_THREAD_STACK_SIZE was set to 2048.  So, I just doubled it to 4096.  The test has been running for 15 minutes now and so far so good!

    Would you happen to have any theories why the Thread Analyzer module never reported more than 28% stack usage for the logging thread in my attached log?  At some point in the near future I will need to tune all the stack sizes prior to production.  I'm wondering if I have it configured correctly.  Here's how I have it configured in my prj.conf file:

    # Enable & Configure the Thread Analyzer
    CONFIG_THREAD_ANALYZER=y
    CONFIG_THREAD_ANALYZER_AUTO=y
    CONFIG_THREAD_NAME=y
    CONFIG_THREAD_ANALYZER_USE_LOG=y
    CONFIG_THREAD_ANALYZER_AUTO_INTERVAL=5

Reply
  • Hi Amanda,

    Good catch!  I can't believe I glossed over that. 

    CONFIG_LOG_PROCESS_THREAD_STACK_SIZE was set to 2048.  So, I just doubled it to 4096.  The test has been running for 15 minutes now and so far so good!

    Would you happen to have any theories why the Thread Analyzer module never reported more than 28% stack usage for the logging thread in my attached log?  At some point in the near future I will need to tune all the stack sizes prior to production.  I'm wondering if I have it configured correctly.  Here's how I have it configured in my prj.conf file:

    # Enable & Configure the Thread Analyzer
    CONFIG_THREAD_ANALYZER=y
    CONFIG_THREAD_ANALYZER_AUTO=y
    CONFIG_THREAD_NAME=y
    CONFIG_THREAD_ANALYZER_USE_LOG=y
    CONFIG_THREAD_ANALYZER_AUTO_INTERVAL=5

Children
Related