Inconsistent QSPI flash behavior between APP and MCUBOOT

It seems that I'm getting different results reading QSPI flash from the APP vs MCUBOOT, with the exact same configuration. Writes appear to work fine from either, but MCUBOOT isn't able to read the flash, whereas the APP can.

Initially I was trying to track down why uploaded DFU images weren't able to be validated by MCUBOOT, which looks like this:

I'm starting to think it's because MCUBOOT isn't properly reading the image from flash.  I am able to upload images via SERIAL directly to MCUBOOT, or via BLE to MCUMGR in the APP.  In either case, I am able to see the uploaded image through MCUMGR from the APP, but neither shows up in the image list when accessing from MCUBOOT, see example:

Initial image list from BLE (APP/MCUMgr):

Initial image list from SERIAL (MCUBOOT):

Upload image 105CFF via BLE and see it was uploaded to slot 1:

Image list with both images uploaded via SERIAL:

Upload image 777E0 via SERIAL ,and view the image list:

View the image list via BLE and see slot 1 was in fact replaced with 777E0:

The DTS is identical for both applications.  Here is the PM config:

app:
  address: 0x12200
  end_address: 0xfe000
  region: flash_primary
  size: 0xebe00
external_flash:
  address: 0xec000
  end_address: 0x400000
  region: external_flash
  size: 0x314000
mcuboot:
  address: 0x0
  end_address: 0x12000
  region: flash_primary
  size: 0x12000
mcuboot_pad:
  address: 0x12000
  end_address: 0x12200
  placement:
    align:
      start: 0x1000
    before:
    - mcuboot_primary_app
  region: flash_primary
  size: 0x200
mcuboot_primary:
  address: 0x12000
  end_address: 0xfe000
  orig_span: &id001
  - mcuboot_pad
  - app
  region: flash_primary
  size: 0xec000
  span: *id001
mcuboot_primary_app:
  address: 0x12200
  end_address: 0xfe000
  orig_span: &id002
  - app
  region: flash_primary
  size: 0xebe00
  span: *id002
mcuboot_secondary:
  address: 0x0
  device: DT_CHOSEN(nordic_pm_ext_flash)
  end_address: 0xec000
  placement:
    align:
      start: 0x4
  region: external_flash
  share_size:
  - mcuboot_primary
  size: 0xec000
settings_storage:
  address: 0xfe000
  end_address: 0x100000
  placement:
    align:
      start: 0x1000
    before:
    - end
  region: flash_primary
  size: 0x2000
sram_primary:
  address: 0x20000000
  end_address: 0x20040000
  region: sram_primary
  size: 0x40000

Any ideas?  Thank you!

Chris

Parents
  • Hi Chris,

    Abhijith is out of office, and I will continue to support you in his absence.

    I have gone over the entire thread and here I will attempt to summarize where we are at. Please let me know if something is wrong.

    • The issue is multiple bit flip, about once every 400 bytes of data.
    • Under these conditions, you are having the issue:
      • Custom board.
      • NCS v2.6.4 and NCS v2.7.0.
        • Built without Sysbuild on both
      • External flash.
      • Bluetooth transfer.
    • With the above conditions, but using UART transfer instead, the issue doesn't happen.
    • The custom board can run the spi_flash_test.zip test app that Abhijith sent, proving that the physical custom board and the DTS setup are good
      • The successful UART transfer proves it as well.

    Maybe I missed something, but I would like to check these details:

    Hieu

  • Hi Hieu,

    I know there's a lot looking at the history, but let me try to summarize the most recent and relevant updates, and we can go from there.

    My most recent updates lead us to believe:

    1. the bit flips happen during the upload, not the swap
    2. the bit flips do not happen when uploading via serial DFU, but only during BLE DFU
    3. if the initial image uploaded is good (i.e. using serial DFU) then there is no issue with the swap

    The project that I'm using enabled the following Kconfig options:

    CONFIG_NCS_SAMPLE_MCUMGR_BT_OTA_DFU=y
    CONFIG_MCUMGR_MGMT_NOTIFICATION_HOOKS=y
    CONFIG_MCUMGR_GRP_OS_RESET_HOOK=y
    I just don't understand why the image doesn't upload correctly using BLE DFU.
    Thanks,
    Chris
  • Hi Chris,

    Thank you for the update. The new information makes me feel stronger about this guess. Could you check it?

    Hieu said:
    How is the board powered?
    • I wonder if high DFU throughput cause the power to dip enough that create issues with the external flash.
      • Can be verified by powering the board with a strong external power supply.

    Also, to explain a bit about the motivations for these requests:

    Hieu said:
    Can the issue be reproduced on a nRF52840 DK?

    This should be done with the (relatively same firmware you are testing on your custom board. The purpose is to rule out any corner case in software. This is quite unlikely, but if it is possible, it is good to be able to rule it out.

    Hieu said:
    Can the issue be reproduced with this app: ncs-inter/l9/l9_e3_sol/qspi at v2.9.0-v2.7.0 · NordicDeveloperAcademy/ncs-inter?

    This is to see if the software size and/or other configurations help and get more clues.

    Best regards,

    Hieu

  • Hi Hieu,

    The board should have sufficient power, I've tried a variety of power sources.

    As you suggested, I tried a "bare bones" app which just configures QSPI and DFU using BT.  Sure enough, the DFU works with no problem,

    There must be something else in the configuration of my real app that is causing it to behave incorrectly, but I don't have any idea what it might be.

    Thanks,

    Chris

  • Hi Chris,

    I am running out of ideas. There is no previous record of such a problem except very very rare cases of damaged flash memory.

    With how many units are you having this issue? And how many units overall have you tested?

    Do you have the issue with a DK?

    For the bare bones app, if you add some huge static int array to bloat it up and DFU, does the issue happen?

    Best regards,

    Hieu

  • Hi Hieu,

    Thanks, I'm also at a bit of a loss, but I'm continuing to test to try to narrow down what set of circumstances trigger the behavior.  It's slow going, as you know.

    Unfortunately I don't have a DK to test with, because that would be interesting.  I do have two variants of board I'm testing with, one (32Mbit) it seems to happen regardless of the code loaded.  The other (16Mbit), it only happens with the full code, not the bare bones setup I previously referenced.

    I'm not ruling out the power issues you suggested before, I'm wondering it the difference could be as slight as having extra peripherals enabled in our code, being just enough to cause the bit flips.  Though to reduce the likelihood, I have reduced the sck to 2M and specified single data-line read/write.  Doesn't seem to help, though.

    Without a quantitative way to determine that, I'm limited to empirical deduction.  If yo have any suggestions how to improve that testing, I'd love to hear them.  I'll keep you posted, and thanks for your help!

    Thanks,

    Chris

Reply
  • Hi Hieu,

    Thanks, I'm also at a bit of a loss, but I'm continuing to test to try to narrow down what set of circumstances trigger the behavior.  It's slow going, as you know.

    Unfortunately I don't have a DK to test with, because that would be interesting.  I do have two variants of board I'm testing with, one (32Mbit) it seems to happen regardless of the code loaded.  The other (16Mbit), it only happens with the full code, not the bare bones setup I previously referenced.

    I'm not ruling out the power issues you suggested before, I'm wondering it the difference could be as slight as having extra peripherals enabled in our code, being just enough to cause the bit flips.  Though to reduce the likelihood, I have reduced the sck to 2M and specified single data-line read/write.  Doesn't seem to help, though.

    Without a quantitative way to determine that, I'm limited to empirical deduction.  If yo have any suggestions how to improve that testing, I'd love to hear them.  I'll keep you posted, and thanks for your help!

    Thanks,

    Chris

Children
  • Hi Hieu,

    After having narrowed this down as far as we have, I think I'm convinced it must be a hardware issue that only crops up in certain software configurations, causing it to masquerade as a software issue.

    While I have no way to prove it at the moment, I do suspect power stability for the flash chip may be the culprit.  I can't come up with any other explanation, and I think we've exhausted all of the other possibilities.

    Thank you for your help!  I think we can consider this resolved, and I'll try to understand why it's an issue in our case.  Thanks again!

    Chris

  • Hi Chris,

    It's just my job to help, but I don't feel like I was able to support much this time. Thank you too, for the pleasant cooperation.

    In that case, I will close this case. We do provide hardware review, so if you feel the need for it, please feel free to open a new private DevZone case.

    Hieu

Related