Kernel panic on nRF9151 when connecting to AWS

Hi,

I'm developing an application running on a nRF9151 that connects to AWS.  I'm using the aws_iot library and my code is based on the aws_iot sample application.  While I have been able to successfully connect to AWS at times, for the last day or so, I've been getting kernel panics when calling the aws_iot_connect() function as shown below.  According to the log, the fault occurs during interrupt handling in the idle thread.  I've traced the faulting instruction address to assert.c.  Apart from opening the LTE connection before calling aws_iot_connect(), my application isn't doing anything else (that I am aware of) when the kernel panic occurs.  I have the MQTT_HELPER_STACK_SIZE set to 4096.  (I have also tried doubling it to 8192 but the kernel panic still occurred).

I've also tried increasing the sizes of the main stack, the workqueue stack and the heap.  None of these helped.

I have a suspicion that the kernel panic happens just at the point that the connection to AWS is made.  If the connection attempt timeouts, there is no kernel panic.  The last line in the log before the panic is always the same, i.e.

<dbg> mqtt_helper: mqtt_state_set: State transition: MQTT_STATE_DISCONNECTED --> MQTT_STATE_TRANSPORT_CONNECTING

(I did find an ASSERT in the function that sets the MQTT state and tried commenting it out, but that didn't fix the problem.)

As mentioned above, I have seen this same code connect successfully in the past.  Is there something at the AWS end that could cause the panic (for example, a malformed message)?  Unfortunately, I don't know enough about the details of MQTT to know if this is plausible.

Do you have any ideas as to what the problem might be, or what else I can try to get better visibility of it?

Thanks

Scott

Parents
  • Edit: Oops, just noticed you already traced the instruction to assert.c. Maybe the link below will still help.

    Try to look up where function is located with arm-none-eabi-addr2line; this (usually) gives the source file and line number:

    arm-none-eabi-addr2line -e build-folder/zephyr/zephyr.elf 0x000612d2

    For such an interrupt error, this discussion might help:

    zephyr-fatal-error-4-kernel-panic-on-cpu-0

  • Hi,

    Thanks for the suggestions.

    I had a read through the discussion you linked to.  However, I don't think it explains the problem that I'm having, because I'm not doing anything (that I'm aware of) under interrupt.  Task scheduling, including connecting to AWS, is all done using the system workqueue.  The aws_iot library does make use of an event_handler but all that does is print out the event to the log (and I'm using deferred logging).  I guess it's possible that there is something in the aws_iot library or the mqtt_helper that's attempting to take a semaphore under interrupt but that seems unlikely as I'm sure other people would have encountered the same problem before me.

    Scott

Reply
  • Hi,

    Thanks for the suggestions.

    I had a read through the discussion you linked to.  However, I don't think it explains the problem that I'm having, because I'm not doing anything (that I'm aware of) under interrupt.  Task scheduling, including connecting to AWS, is all done using the system workqueue.  The aws_iot library does make use of an event_handler but all that does is print out the event to the log (and I'm using deferred logging).  I guess it's possible that there is something in the aws_iot library or the mqtt_helper that's attempting to take a semaphore under interrupt but that seems unlikely as I'm sure other people would have encountered the same problem before me.

    Scott

Children
Related