CONFIG_PM_DEVICE_RUNTIME breaks nrf9151

We have been trying to upgrade nRF connect SDK from v3.1.1 to v3.2.1 and run into the following issue:

When we enable CONFIG_PM_DEVICE_RUNTIME=y after flashing, the device gets inaccessible via J-LINK.

It is easy to reproduce on nrf9151dk, blinky sample, by adding the following config:

CONFIG_PM_DEVICE=y
CONFIG_PM_DEVICE_RUNTIME=y
CONFIG_PM_DEVICE_POWER_DOMAIN=y
CONFIG_POWER_DOMAIN=y


After flashing, the device becomes inaccessible via J-LINK.

nrfutil device reset
❌ Failed to reset 1051215211, Failed to attach to target: 
The Application core access port is protected Error: One or more reset tasks failed:
 * 1051215211: Failed to attach to target:
 The Application core access port is protected (All) (NotAvailableBecauseProtection)


The operation was unavailable for some of the devices because they had readback protection enabled. These devices might be possible to unlock using the `recover` subcommand. See `nrfutil device recover --help` for more information.

If CONFIG_PM_DEVICE_RUNTIME is disabled, everything works as expected. 
Parents Reply Children
  • Hi Vidar, Yeh, I agree. But it the same time, I expect that the core subsystem handles this properly.

    IMHO, the root cause is that the nRF SDK appears to be based on unstable Zephyr revisions (taken from the unstable main branch) rather than on official stable Zephyr releases. As a result, issues that briefly exist in Zephyr main can end up being shipped in an SDK release.
    For example, SDK 3.1.X:
     * There was a bug in posix/options/clock_common.c. This issue existed in Zephyr main only for a short period of time, yet the SDK was released with it. The problem appeared after K_SPINLOCK was temporarily replaced with a semaphore and was later reverted back to K_SPINLOCK. No official Zephyr release contained this bug — only the nRF SDK did.
     * Another issue involved pm_device_runtime_put_async(), where the internal usage counter kept increasing and never decreased. Again, this problem does not exist in released Zephyr versions.


    We have also already identified an issue in SDK 3.2.x (you can consider this an official bug report for that release):

    In ncs/zephyr/subsys/shell/shell.c, the function state_collect() is missing the shell lock/unlock calls around the bypass logic. This causes incorrect behavior. The stable Zephyr implementation includes proper locking, while the SDK version does not. Once again, this issue exists only in the SDK and not in Zephyr releases.

    nrf SDK buggy version:
    void state_collect(const struct shell *sh)

    static void state_collect(const struct shell *sh)
    {
        ...
    				z_flag_cmd_ctx_set(sh, true);
    				bypass(sh, buf, count);
    				z_flag_cmd_ctx_set(sh, false);
    	...
    }

    zephyr official stable version:

    static void state_collect(const struct shell *sh)
    {
        ...
    				z_flag_cmd_ctx_set(sh, true);
    				z_shell_unlock(sh);
    				bypass(sh, buf, count);
    				z_shell_lock(sh);
    				z_flag_cmd_ctx_set(sh, false);
    	...
    }


    Taken together, these examples suggest that there may be gaps in the current QA or integration process when pulling changes from Zephyr into the SDK. Using stable Zephyr releases as a base — or applying stricter validation before SDK releases — would likely prevent these kinds of regressions.

    Best regards,
    Alexey

     

  • The issue you are observing is not related to unstable or stable branches of zephyr, its simply that our fork of zephyr as part of NCS is behind upstream zephyr by around 1-4 months. We run near all of zephyr's test cases on real hardware as part of our release, so any issue in zephyr that is covered by a test case would be caught and we would fix it in upstream zephyr, and cherry pick the fix to NCS.

    The example with the shell you are showing is such a case, its not covered by any test case, and as such, can only be caught by someone running into it in the field. This is unrelated to the timelines of zephyr or NCS releases.

    Interestingly, the "zephyr official stable version" actually broke one of my applications, since I was using shell bypass within a command handler, something which now results in a deadlock. It really looks like we need a few test cases to cover bypass mode in general...

    NCS is a fork, it will always be a bit behind zephyr, which means we don't have the latest fixes, and we also don't have the latest bugs. We try to keep the delay at a minimum.

  • Thanks for the clarification. I agree that missing test coverage is likely the main reason why issues like the shell bypass case slip through. Areas that are not exercised by automated tests will inevitably depend on real-world usage to reveal problems.

    That said, the observation that prompted my original comment still stands: the issues I mentioned did not make it into official Zephyr releases, but they did appear in the NCS SDK. So even if they originated in upstream Zephyr at some point in time, they were already fixed again before a stable Zephyr release was cut.

    From the perspective of someone using the SDK, this creates the impression that the integration point between Zephyr and NCS sometimes happens during a window where a regression briefly exists upstream but has not yet been corrected in the fork.

    In any case, thanks for the explanation and for looking into this.

    Best regards,
    Alexey

Related