Kernel panic on nRF9151 when connecting to AWS

Hi,

I'm developing an application running on a nRF9151 that connects to AWS.  I'm using the aws_iot library and my code is based on the aws_iot sample application.  While I have been able to successfully connect to AWS at times, for the last day or so, I've been getting kernel panics when calling the aws_iot_connect() function as shown below.  According to the log, the fault occurs during interrupt handling in the idle thread.  I've traced the faulting instruction address to assert.c.  Apart from opening the LTE connection before calling aws_iot_connect(), my application isn't doing anything else (that I am aware of) when the kernel panic occurs.  I have the MQTT_HELPER_STACK_SIZE set to 4096.  (I have also tried doubling it to 8192 but the kernel panic still occurred).

I've also tried increasing the sizes of the main stack, the workqueue stack and the heap.  None of these helped.

I have a suspicion that the kernel panic happens just at the point that the connection to AWS is made.  If the connection attempt timeouts, there is no kernel panic.  The last line in the log before the panic is always the same, i.e.

<dbg> mqtt_helper: mqtt_state_set: State transition: MQTT_STATE_DISCONNECTED --> MQTT_STATE_TRANSPORT_CONNECTING

(I did find an ASSERT in the function that sets the MQTT state and tried commenting it out, but that didn't fix the problem.)

As mentioned above, I have seen this same code connect successfully in the past.  Is there something at the AWS end that could cause the panic (for example, a malformed message)?  Unfortunately, I don't know enough about the details of MQTT to know if this is plausible.

Do you have any ideas as to what the problem might be, or what else I can try to get better visibility of it?

Thanks

Scott

  • Edit: Oops, just noticed you already traced the instruction to assert.c. Maybe the link below will still help.

    Try to look up where function is located with arm-none-eabi-addr2line; this (usually) gives the source file and line number:

    arm-none-eabi-addr2line -e build-folder/zephyr/zephyr.elf 0x000612d2

    For such an interrupt error, this discussion might help:

    zephyr-fatal-error-4-kernel-panic-on-cpu-0

  • Hi,

    Thanks for the suggestions.

    I had a read through the discussion you linked to.  However, I don't think it explains the problem that I'm having, because I'm not doing anything (that I'm aware of) under interrupt.  Task scheduling, including connecting to AWS, is all done using the system workqueue.  The aws_iot library does make use of an event_handler but all that does is print out the event to the log (and I'm using deferred logging).  I guess it's possible that there is something in the aws_iot library or the mqtt_helper that's attempting to take a semaphore under interrupt but that seems unlikely as I'm sure other people would have encountered the same problem before me.

    Scott

  • Having said that, there is something odd going on within the aws_iot_connect() function.  CONFIG_AWS_IOT_CONNECT_TIMEOUT is set to 30s by default, but it is taking around 150s for the connection attempt to timeout.  It appears that the kernel panic always occurs on the second call to aws_iot_connect() after the first has timed out.

  • I disabled CONFIG_ASSERT and now I can see that the problem is because there is insufficient heap space for an incoming notification.  I have tried increasing the HEAP_MEM_POOL_SIZE to 65536 bytes but I'm still seeing the same message.  Is there another heap setting that I need to change? 

Related