CONFIG_PM_DEVICE_RUNTIME breaks nrf9151

We have been trying to upgrade nRF connect SDK from v3.1.1 to v3.2.1 and run into the following issue:

When we enable CONFIG_PM_DEVICE_RUNTIME=y after flashing, the device gets inaccessible via J-LINK.

It is easy to reproduce on nrf9151dk, blinky sample, by adding the following config:

CONFIG_PM_DEVICE=y
CONFIG_PM_DEVICE_RUNTIME=y
CONFIG_PM_DEVICE_POWER_DOMAIN=y
CONFIG_POWER_DOMAIN=y


After flashing, the device becomes inaccessible via J-LINK.

nrfutil device reset
❌ Failed to reset 1051215211, Failed to attach to target: 
The Application core access port is protected Error: One or more reset tasks failed:
 * 1051215211: Failed to attach to target:
 The Application core access port is protected (All) (NotAvailableBecauseProtection)


The operation was unavailable for some of the devices because they had readback protection enabled. These devices might be possible to unlock using the `recover` subcommand. See `nrfutil device recover --help` for more information.

If CONFIG_PM_DEVICE_RUNTIME is disabled, everything works as expected. 
Parents
  • Hello,

    You are likely seeing that the conditions for errata #36 - Access port gets locked in WFI and WFE are being met when you start optimising for power consumption. As long as the UART receiver is kept on, the HF clock request will be held and prevent the system from entering the lowest power modes and thus mask the problem. In ncs/v3.2.1/modules/hal/nordic/nrfx/bsp/stable/mdk/system_nrf91_approtect.h, please try to comment the ifdef at line 55 to enable the workaround for this errata (we are working on adding a Kconfig setting to enable this workaround).  That said, I don't immediately see any reason for why you would only experience this with 3.2.1 but not 3.1.1. Have you measured if both project have the same floor current in idle by any chance?

    Best regards,

    Vidar

  • Hi Vidar, enabling the workaround helps to solve APPROTECT issue, but then we run into some other issues, for example, logging stops working, but the shell still works. We still didn't figure out what is wrong.

    That said, disabling CONFIG_PM_DEVICE_RUNTIME_DEFAULT_ENABLE=n (this one is new in v3.2.x) solves everything, logging, other issues, and also the issue with APPROTECT without any workarounds or making changes in system files (system_nrf91_approtect.h).

    Bzw, now when everything seems to be working, our idle current for 3.1.1 and 3.2.1 I would say equal and it is around 70 uA.

    BR,
    Alexey

  • Hi Alexey,

    Thank you. With CONFIG_PM_DEVICE_RUNTIME and CONFIG_PM_DEVICE_RUNTIME_DEFAULT_ENABLE selected the driver will automatically start in the suspended state, and it is the responsibility of the subsystem or application using the device to request it first using pm_device_runtime_get().

    - In logger backend: https://github.com/nrfconnect/sdk-zephyr/commit/5bf8edc85f83882a1380163aa10d81c2c3493848

    - In Shell backend: https://github.com/nrfconnect/sdk-zephyr/commit/a8bc748c250b0f8479e883300fe6becfc1d6bfb3

    Best regards,

    Vidar

  • Hi Vidar, Yeh, I agree. But it the same time, I expect that the core subsystem handles this properly.

    IMHO, the root cause is that the nRF SDK appears to be based on unstable Zephyr revisions (taken from the unstable main branch) rather than on official stable Zephyr releases. As a result, issues that briefly exist in Zephyr main can end up being shipped in an SDK release.
    For example, SDK 3.1.X:
     * There was a bug in posix/options/clock_common.c. This issue existed in Zephyr main only for a short period of time, yet the SDK was released with it. The problem appeared after K_SPINLOCK was temporarily replaced with a semaphore and was later reverted back to K_SPINLOCK. No official Zephyr release contained this bug — only the nRF SDK did.
     * Another issue involved pm_device_runtime_put_async(), where the internal usage counter kept increasing and never decreased. Again, this problem does not exist in released Zephyr versions.


    We have also already identified an issue in SDK 3.2.x (you can consider this an official bug report for that release):

    In ncs/zephyr/subsys/shell/shell.c, the function state_collect() is missing the shell lock/unlock calls around the bypass logic. This causes incorrect behavior. The stable Zephyr implementation includes proper locking, while the SDK version does not. Once again, this issue exists only in the SDK and not in Zephyr releases.

    nrf SDK buggy version:
    void state_collect(const struct shell *sh)

    static void state_collect(const struct shell *sh)
    {
        ...
    				z_flag_cmd_ctx_set(sh, true);
    				bypass(sh, buf, count);
    				z_flag_cmd_ctx_set(sh, false);
    	...
    }

    zephyr official stable version:

    static void state_collect(const struct shell *sh)
    {
        ...
    				z_flag_cmd_ctx_set(sh, true);
    				z_shell_unlock(sh);
    				bypass(sh, buf, count);
    				z_shell_lock(sh);
    				z_flag_cmd_ctx_set(sh, false);
    	...
    }


    Taken together, these examples suggest that there may be gaps in the current QA or integration process when pulling changes from Zephyr into the SDK. Using stable Zephyr releases as a base — or applying stricter validation before SDK releases — would likely prevent these kinds of regressions.

    Best regards,
    Alexey

     

  • The issue you are observing is not related to unstable or stable branches of zephyr, its simply that our fork of zephyr as part of NCS is behind upstream zephyr by around 1-4 months. We run near all of zephyr's test cases on real hardware as part of our release, so any issue in zephyr that is covered by a test case would be caught and we would fix it in upstream zephyr, and cherry pick the fix to NCS.

    The example with the shell you are showing is such a case, its not covered by any test case, and as such, can only be caught by someone running into it in the field. This is unrelated to the timelines of zephyr or NCS releases.

    Interestingly, the "zephyr official stable version" actually broke one of my applications, since I was using shell bypass within a command handler, something which now results in a deadlock. It really looks like we need a few test cases to cover bypass mode in general...

    NCS is a fork, it will always be a bit behind zephyr, which means we don't have the latest fixes, and we also don't have the latest bugs. We try to keep the delay at a minimum.

Reply
  • The issue you are observing is not related to unstable or stable branches of zephyr, its simply that our fork of zephyr as part of NCS is behind upstream zephyr by around 1-4 months. We run near all of zephyr's test cases on real hardware as part of our release, so any issue in zephyr that is covered by a test case would be caught and we would fix it in upstream zephyr, and cherry pick the fix to NCS.

    The example with the shell you are showing is such a case, its not covered by any test case, and as such, can only be caught by someone running into it in the field. This is unrelated to the timelines of zephyr or NCS releases.

    Interestingly, the "zephyr official stable version" actually broke one of my applications, since I was using shell bypass within a command handler, something which now results in a deadlock. It really looks like we need a few test cases to cover bypass mode in general...

    NCS is a fork, it will always be a bit behind zephyr, which means we don't have the latest fixes, and we also don't have the latest bugs. We try to keep the delay at a minimum.

Children
  • Thanks for the clarification. I agree that missing test coverage is likely the main reason why issues like the shell bypass case slip through. Areas that are not exercised by automated tests will inevitably depend on real-world usage to reveal problems.

    That said, the observation that prompted my original comment still stands: the issues I mentioned did not make it into official Zephyr releases, but they did appear in the NCS SDK. So even if they originated in upstream Zephyr at some point in time, they were already fixed again before a stable Zephyr release was cut.

    From the perspective of someone using the SDK, this creates the impression that the integration point between Zephyr and NCS sometimes happens during a window where a regression briefly exists upstream but has not yet been corrected in the fork.

    In any case, thanks for the explanation and for looking into this.

    Best regards,
    Alexey

Related