This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nrf9160 secure services causing a crash

I have been trying to use the secure services to read random from the CC310 and have been getting unpredictable crashes. I have been able to modify the secure_services sample code to create a crash, but it's the oddest thing. The combination of CONFIG_LOG=y and CONFIG_DK_LIBRARY=y along with a call to dk_buttons_init() is enough to cause the exception. No buttons need to be pressed and no log messages generated by me.

I'm using ncs 1.2.0 on a nRF9160 DK. Can anybody explain what's happening? Modified secure_services sample attached

Mike

secure_services.zip

Parents
  • The behavior changes to "working" by increasing CONFIG_DK_LIBRARY_BUTTON_SCAN_INTERVAL in prj.conf from the default of 10 to 100. I can't even come up with an explanation for this

  • Hi.

    What an interesting problem you have found.

    I have done some testing myself, and the fault only happens if the buttons are initialized. Logging does not matter, other than for printing the fault message.

    It also only happens when we try to get random numbers and not any of the other secure services.

    Based on your comment regarding the scan interval, I expect it is a race condition related to the transition between the secure and non-secure domains.

    However, I will have do look deeper into it next week to be able to pinpoint the cause of the problem.

    Best regards,

    Didrik

Reply
  • Hi.

    What an interesting problem you have found.

    I have done some testing myself, and the fault only happens if the buttons are initialized. Logging does not matter, other than for printing the fault message.

    It also only happens when we try to get random numbers and not any of the other secure services.

    Based on your comment regarding the scan interval, I expect it is a race condition related to the transition between the secure and non-secure domains.

    However, I will have do look deeper into it next week to be able to pinpoint the cause of the problem.

    Best regards,

    Didrik

Children
  • I'm glad you see it too. I looked into it a little bit and it seemed to be related to the button_scan_fn or maybe zephyr's handling of the workq, but that's as far as I got.

  • Hi again.

    I have found another way of avoiding the crash: changing the system workqueue's priority to a positive priority (e.g. 1).

    The changes the system workqueue thread from a cooperative one to a preemptible one. That allows other threads (with higher priorities) to interrupt the system workqueue.

    You can read more here: https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/zephyr/reference/kernel/threads/index.html#thread-priorities

    However, I have not been able to find the root cause of the crash. I will continue to investigate together with the SDK team.

    Best regards,

    Didrik

  • Hi.

    A quick update:

    The error does not seem to be linked to the buttons library in particular, but to the workqueue.

    It seems the workqueue thread is not allowed to run when the timeout happens while in secure mode, and consequently an error happens once the main thread is unloaded.

    We are investigating ways to solve this.

  • Hi again.

    There has been some further progress in finding the root cause, but it does not look like there is an easy fix, so it will probably take some time until it is properly fixed.

    Here is an explanation of what probably happens, given by one of our developers:

    "Short story: if a Zephyr thread context-switch occurs while doing a Secure call, the Zephyr kernel may crash.

    Elaboration:

    This might, actually, be happening in the scenario reported in this issue: I believe, so, because it seems the secure service takes some time to finish, so an NSPE Zephyr system time may fire, in the mean time, and lead to a thread re-schedule point.

     

    When in a regular function call, the Zephyr Cortex-M swap logic (running in PendSV IRQ, possibly tail-chained from a higher priority IRQ context) will simply stack the current execution context into the thread's stack, tweak PSP, and eventually return into the new thread's execution context, using that thread's stack(ed) information.

    When in a Secure function call, however, things are not as simple as the above scenario. The swap logic will still change the NSPE context, but it will return to the SPE to finish the secure service. When the secure service is complete, the TrustZone logic will attempt to branch to the NSPE (most likely using BXNS LR instruction), however, since the NSPE context has changed, the service call is not returning to the context it was called from.

     

    This bug, may be causing system crashes (possibly such as the one we are currently observing here), but is, also a direct Security violation, since secure service result is exposed to a NSPE context other than the caller."

    We are looking into what is the best way to ensure that a secure call always returns to the caller.

    Best regards,

    Didrik

Related