Error while erasing external QSPI flash with nordic,qspi-nor driver

I have found a bug in SDK v2.6.0 at zephyr/drivers/flash/nrf_qspi_nor.c while erasing the external MX25R64 flash.

Environment:

nRF5340 DK Hardware (MX25R64 external flash over QSPI with nordic,qspi-nor driver)
NCS SDK v2.6.0 (sdk-zephyr tag v3.5.99-ncs1)

I have a custom project which uses the FLASH_MAP API to erase the external flash in a loop at 64KB increments (offsets: 0x0, 0x10000, etc...). The first one succeeds, but the second one fails at offset 0x10000. I am pretty sure this is a timing issue. This does NOT occur on the nRF7002 DK, which communicates to the same flash chip via SPI instead of QSPI (which also uses the generic JEDEC compatibility SPI NOR driver instead of a Nordic driver).

I tracked the failure down to zephyr/drivers/flash/nrf_qspi_nor.c:qspi_nor_write_protection_set() on the first call which uses SPI_NOR_CMD_WREN. There's a couple issues with the original function:

The return code from qspi_send_cmd() is ignored and replaced with -EIO in all failure cases; this makes it impossible to distinguish the root cause.
In the problem case, qspi_send_cmd() returns -EBUSY; so this means the chip is likely not ready for the operation to be performed yet.

static int qspi_nor_write_protection_set(const struct device *dev,
					 bool write_protect)
{
	int rc = 0;
	struct qspi_cmd cmd = {
		.op_code = ((write_protect) ? SPI_NOR_CMD_WRDI : SPI_NOR_CMD_WREN),
	};

	if (qspi_send_cmd(dev, &cmd, false) != 0) {
		rc = -EIO;
	}

	return rc;
}

I implemented a local workaround which solves the issue by retrying the operation while -EBUSY is received up to a maximum number of attempts; it fails with -EBUSY on the first attempt and then returns 0 on the second attempt:

static int qspi_nor_write_protection_set(const struct device *dev,
					 bool write_protect)
{
	int rc = 0;
	unsigned int attempts = 3;
	struct qspi_cmd cmd = {
		.op_code = ((write_protect) ? SPI_NOR_CMD_WRDI : SPI_NOR_CMD_WREN),
	};

	while(attempts) {
		rc = qspi_send_cmd(dev, &cmd, false);
		LOG_ERR("Write protect[0x%02x]: %d (%u)", cmd.op_code, rc, attempts);
		if(!rc) {
			break;
		} else {
			if(rc != -EBUSY) {
				break;
			}
			rc = -EIO;
		}
		--attempts;
	}

	return rc;
}

Possible solutions:

Send in a loop while -EBUSY is received
Query the WEL bit in the status register after the command has been sent to verify the status is updated.
- From the data sheet, the Write Status Register Cycle time is ~10ms up to 20ms max.
Query the chip before doing an erase to see if it is in a state where the commands may fail.
- Macronix data sheet v1.6 section 8 states that "Before a command is issued, status register should be checked to ensure device is ready for the intended
  operation.". See also section 10.8 "Status Register" (p30).

Top Replies

Parents

0 BertL over 1 year ago

This issue is maybe related to a qspi erase bug I encountered: qspi_nor: Failed to schedule device sleep: -16

Will follow this topic for possible fixes.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 BertL over 1 year ago

This issue is maybe related to a qspi erase bug I encountered: qspi_nor: Failed to schedule device sleep: -16

Will follow this topic for possible fixes.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Daniel K over 1 year ago in reply to BertL
If you want to see if it's related, maybe add the -EBUSY retry loop to your local checkout of the SDK. Having a loop will be better than a fixed delay.

My local workaround at the application level is to check for the -EIO return value that the nordic,qspi-nor driver returns in this case:
unsigned int attempts = 3; while(attempts) { err = flash_area_erase(fa, offset, erase_size); if(err == -EIO) { --attempts; } else { // Stop on success or other error. break; } }

Performance-wise, it is exactly the same as the same flash chip on the nRF7002 DK.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Sigurd Hellesvik over 1 year ago in reply to Daniel K

Thanks for the report!

I will create an internal ticket on this, but first:
Have you tried to increase CONFIG_NORDIC_QSPI_NOR_TIMEOUT_MS?

Regards,
Sigurd Hellesvik
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Daniel K over 1 year ago in reply to Sigurd Hellesvik

I tried 0 for CONFIG_NORDIC_QSPI_NOR_TIMEOUT_MS, which failed still (this is effectively the same as 500ms timeout since NRFX QSPI will use 50000 attempts at 10ns delay). 2000 did work, but I don't think a timeout is the best way to go here.

The failure in occurs on this section of nrfx_qspi_cinstr_xfer():

if (!m_cb.activated && qspi_activate(true) == NRFX_ERROR_TIMEOUT)

{
return NRFX_ERROR_TIMEOUT;
}

I was looking into the generic SPI NOR driver to see what time should be used and noticed the concept of a timeout duration does not apply. nrfx_qspi uses the NRFX_WAIT_FOR() macro to wait for completion, but the nrfx_spi HAL does not. So timeout as a concept only applies to the QSPI peripheral.

The nrf7002 QSPI driver also leaves timeout=0 in nrfx_qspi_init(). It doesn't make sense why timeout is only used on QSPI for the flash chip and not on the nRF7002 or SPI peripherals.

Zephyr has a generic SPI driver interface, but not a QSPI interface, and there is no jedec,qspi-nor driver. So there's no other generic implementation to compare against for nrfx_qspi_nor.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Sigurd Hellesvik over 1 year ago in reply to Daniel K

Daniel K said:
2000 did work, but I don't think a timeout is the best way to go here.

Since this works, that kinda means that the code works, right.

So what you want is not a bug report, but you suggest an improvement to our driver, right?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Daniel K over 1 year ago in reply to Sigurd Hellesvik

The defect is that an error is reported with out-of-the-box settings on the nRF5340 DK, and the delay is not fixing the root cause. After erase, the driver does not block the caller to verify the erase operation is complete, this leads to a failure of the next command if it is issued too soon.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel