nrf54: (Ab)using SPIM as a shift-register - between-TXLIST latency

cwriter 1 month ago

I'm trying to drive a 75hc595-type shift register with a SPIM instance on the nrf54l15.

I connect MOSI to the shift register data input, CLK to the shift register clock, and CSN to the output-stage-enable register (i.e. latch from the input shifter to the output). I use the maximum frequency of the SPIM22, namely a prescaler of 2, to reach 8MHz on the output, aiming for 1 MHz output on the shift register. To set the outputs, I use EasyDMA in List mode with a MAXCNT of 1. Using CSNPOL Low, CSN is driven HIGH at the end of the transfer, thus sampling automatically into the shift register. I set IFTIMING.CSNDUR to 0.

Because I have a fixed number of bytes to sample (~240), I use a TIMER22 as a byte-counter, and a PPI group in DPPI20. I link the SPI ENDED event to the SPI STARTED event in one PPI channel, which I add to a PPIGROUP, and into the TIMER22 COUNT event. I then program a CC to the maximum number of bytes, and link it in another PPI channel to the PPIGROUP DISABLE task to break the loop and stop transferring after the programmed number of bytes.

I was expecting _some_ delay caused by the CSN and rearming. But I'm observing >1us of delay between clock pulse sequences (i.e. SPI transfer active), resulting in an oscilloscope-measured time of 2us (instead of slightly more than 1us expected) between CSN pulses, halving the aimed-for frequency. Essentially, the bus idles for 1 us after every byte (which also takes 1us to transfer).

I'm a bit confused about these delays. I know that SHORTS.END_START is supposed to use IFTIMING.CSNDUR as a delay in peripheral clock cycles, so this should be 1/16 us (or 1/8 us) - not a full microsecond, so the peripheral is supposed to be faster. From what I understand from the documentation, the counter is also counting with a max delay of 1 from the PCLK, which is also 16MHz - but what I'm seeing looks like a clock of 1 MHz dominating the results.

Could you kindly point me to my mistake? I'm thinking about using the DMA TX END (rather than the SPI END) event to queue the next transfer, but I'm a bit hesitant regarding the safety of such a hack. Would the timings guarantee that the latency is enough to trigger only after the ENDED event? What would happen if the START task is triggered before (or in the same cycle where) the SPI ENDED event fires?

Is there another, better way? I was thinking about using the SPIM to transfer one array of 240 bytes at once, and pulsing the shift-latch using a different timer, but I'm a bit weary about the ultra-tight and hard-to-debug timing requirements caused by this approach given that my understanding of the PPI latencies already seems to be off.

0 cwriter 1 month ago in reply to Håkon Alseth

Hi

Thank you for the suggestions!

Håkon Alseth said:
I would recommend setting up a test with TIMER/GPIOTE/PPI controlled CSN pin in this scenario, where you start a timer based on the NRF_SPIMx->EVENTS_STARTED event, and clear/set the GPIOTE IN channel when each byte is finished (based on a set capture/compare value).
Håkon Alseth said:
Stalling would be in the 16 MHz cycle range, so in the 62.5 ns range, and depend on other DMA activity happening at the exact same time.

I'll try this in a few days. I would like to use DMA on other peripherals (UART/I2C) concurrently, but I'll see if I can stop those transfers during the sensitive regions. Even at 16 MHz, I would be able to only sustain at most 4 such conflicts, no? Even if I were to set CSN in the PCLK clock cycle before the first SPI CLK posedge, this could lead to issues.

Håkon Alseth said:
You can potentially think about adding a "dummy byte" in every other byte, if this has a negative impact on the timing requirements of the sequence as a whole.

I think this would add another "1us" of waiting, which would degrade into (roughly) the same effect as the 1us setup time.

Håkon Alseth said:
Is this a requirement, ie. that the display must run on 0.75*8 = 6 MHz? This is hard to obtain based on a 16 MHz peripheral clock.

Unfortunately, Sharp does not seem to provide information for small-scale use cases. From what I can see, it looks like the datasheet is (mainly) modeled to match the driver IC implementation. I think this given frequency is the max clock BCK can be driven with, and it's the "minimum" clock to still adhere to the refresh rate constraints. From a technical standpoint, the Memory-In-Pixel Displays (from what I could gather from the public documentation, anyway) behaves a bit like a shift register, and is mainly a passive component with SRAM cells where the digital clock speed does not really matter. The LCD part (VCOM/VA/VB) is quite easy to achieve, and this one is also timing-sensitive.

The datasheet uses a bit of a weird logic because the listed frequency is "rate of BCK polarity changes", not "rate of BCK rising edges", so the full target using a 595 shifter in-between would be 0.75 MHz * 8 (bits) * 2 (2 edges per cycle) = 12 MHz. This is unobtainium on 16 MHz with a prescaler of 2, but I'll take 8 MHz gladly
So in short, driving the pixel clock at 0.5MHz instead of 0.75MHz (or shifting into the shift register at 8 instead of 12 MHz) should be all ok - it's just that the image transfer time increases.

Håkon Alseth said:
Depending on the timing requirements of the display, this "no delay between bytes" could potentially be problematic? Example: if the pulse on "CSN" must occur for every byte _and_ must be 125 ns, and the bit width of the upcoming sequence is then 125 ns, this leaves the possibility of a skew/overlap, or am I misunderstanding?

The 595 can start shifting into the input register without affecting the outputs, so the display will not see any issue (the display clock BCK can safely be PPI'd + TIMER'd + GPIOTE'd after data was shifted into the output stage of the 595 where there is essentially a 1us window, minus a tiny setup time on the 595's outputs). The issue when clocking uninterrupted is in the (fake) CSN. The 595 shifts to the output stage only on rising edges, so the (fake) CSN pin _could_ essentially just be a 2 MHz 50% duty PWM signal - but the rising edge must be at least 24ns (i.e. at least one PCLK cycle) after the last data bit was shifted (with a rising SPI CLK edge), and must be at least one cycle before the next rising SPI CLK. So yes, there are not many PPI delay cycles that will work when clocking at full rate.

Håkon Alseth said:
What I am trying to ask: Can the "CSN" signal overlap with the upcoming byte safely?

CSN: The rising edge must be after the rising edge of the last bit of each SPI byte, and before the rising edge of the first bit. High time / Low time is lenient (the 595 is pos-edge sampling).

Håkon Alseth said:
Another option would be to use the NRF_I2S peripheral to bit bang this implementation, as it generates a word sync (left/right).

Yes, this would be almost perfect! Unfortunately, the I2S does not support a 4-bit transfer window, and the 595 only samples on the rising edge - therefore, this would also include a dummy byte, and the data rate would again be relatively low.

Håkon Alseth said:
This indicate that the sequence must be in-order, ie. cannot run in parallel?

I did not quite get this, sorry. It can be "parallel" in that there is no need to be able to stall the transfer - but if the transfer is not stalled a little bit, timing the CSN posedge within 1 SPI clock cycle will certainly be hard, especially since it should tolerate SPI clock skew.

The key issue I have is that I'm afraid of desyncs when not timing the events with the actual clock on the SPI clock line. Essentially, I'm looking for some kind of re-synchronization.
I was considering the following other options which would link actual SPI bus behavior to events - maybe you could tell me if this could work? (Unfortunately, I fail to get more information about these elements from the documentation - they seem to be underdocumented for the SPIM compared to the TWI and UART counterparts, and the DMA.MATCH is referenced to the EasyDMA section, which does not discuss it at all).

SPIM SHORT
Just to be clear: You're confirming that the SPIM.SHORTS[END_START] also suffers from a 1us delay? There was a similar question here for an older revision, which seems to indicate that the delay is present even with the short: NRF5340 SPIM turnaround time

SPIM RX MATCH event
If the SPIM MISO line is not connected (no pin configured in the CONFIG register), will the DMA read a defined value (maybe 0)? If so, is it possible to program the DMA.RX.MATCH event to a single 0 byte value which could trigger after every byte, and PPI the CSN toggle? Does this work without setting a DMA read pointer (i.e. with DMA.RX.MAXCNT=0), or is it safe to read into the same buffer that was used to send? Does it match on 4-byte-values only or does it match on single bytes? What is the delay of DMA.RX.MATCH?

SPIM SUSPEND + RESUME
Does a SUSPEND task triggered immediately after starting assert CSN when the byte ends? Does RESUME also have the 1us latency or can the SPIM resume immediately? Even if this does not assert the CSN, this could be an option to PPI the CSN, and potentially stretch the SPI clock in order to get more time in between, i.e. START -> SUSPEND (byte still transfers) -> (Transfer ends) -> PPI'd + TIMER'd + GPIOTE'd CSN triggers -> PPI'd RESUME + TIMER.CLEAR for the next byte. Unfortunately, there seems to be no SUSPENDED event - is this correct? - but the timer could be used to work around this.
In short: Is there any documentation about the SUSPEND mechanism for SPIM available?

SPIM CLK SNIFFING
Not possible from what I take from the datasheet, but worth a try: Is it possible to connect the input buffer of the GPIO to a GPIOTE counting rising edges while the SPI peripheral is clocking it? If yes, this could be used to count the rising edges, and simply trigger on every 8th rise.

SPIM + PPI (assuming no SPI clock/DMA skew at all)
Just to verify: What values would need to be set? Is the PPI triggering on every rising edge of the PCLK (16 MHz) or is it double-clocked on rising and falling edges (32 MHz)? Would I need to set a delay of 7 cycles due to a 1 cycle PPI delay in the same domain?

I2S + PPI
Looking at the I2S peripheral, there is no delay of 1us, and it seems to be possible to clock at the same frequency. Would it be possible to set MAXCNT to 1 and chain the transfers through PPI to achieve a higher throughput than using SPI? Does the I2S peripheral automatically update the TXD pointer internally (given that the description reads: "EVENTS_TXPTRUPD: TDX.PTR register has been copied to internal double-buffers. When the I2S module is started and TX is enabled, this event will be generated for every ceil(RXTXD.MAXCNT/4) words that are sent on the SDOUT pin." it should, trigger on every start when MAXCNT=1 - or did I misread?

Kind regards
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Håkon Alseth 1 month ago in reply to cwriter

Hi,

Just to avoid any misunderstanding here; there is no peripheral/DPPI connection that fully fits your use-case, meaning; there will be side-effects with all proposed solutions.

To sum up your choices, and its side-effects:

* SPIM has a start-up delay, reduces throughput

* DPPI/TIMER/GPIOTE will be problematic to sync towards your byte clock, can be problematic when latching towards the external circuitry.

* NRF_I2S will require a dummy transaction, where you use the LRCK as "CSN".

cwriter said:
SPIM SHORT
Just to be clear: You're confirming that the SPIM.SHORTS[END_START] also suffers from a 1us delay? There was a similar question here for an older revision, which seems to indicate that the delay is present even with the short: NRF5340 SPIM turnaround time

Yes, the delay will be present when you trigger TASKS_START, either via CPU or DPPI.

cwriter said:
Yes, this would be almost perfect! Unfortunately, the I2S does not support a 4-bit transfer window, and the 595 only samples on the rising edge - therefore, this would also include a dummy byte, and the data rate would again be relatively low.

This solution will not have the alignment/skew issue that you mention here:

cwriter said:
The 595 can start shifting into the input register without affecting the outputs, so the display will not see any issue (the display clock BCK can safely be PPI'd + TIMER'd + GPIOTE'd after data was shifted into the output stage of the 595 where there is essentially a 1us window, minus a tiny setup time on the 595's outputs). The issue when clocking uninterrupted is in the (fake) CSN. The 595 shifts to the output stage only on rising edges, so the (fake) CSN pin _could_ essentially just be a 2 MHz 50% duty PWM signal - but the rising edge must be at least 24ns (i.e. at least one PCLK cycle) after the last data bit was shifted (with a rising SPI CLK edge), and must be at least one cycle before the next rising SPI CLK. So yes, there are not many PPI delay cycles that will work when clocking at full rate.

But will require one dummy byte per transferred byte.

cwriter said:
I2S + PPI
Looking at the I2S peripheral, there is no delay of 1us, and it seems to be possible to clock at the same frequency. Would it be possible to set MAXCNT to 1 and chain the transfers through PPI to achieve a higher throughput than using SPI? Does the I2S peripheral automatically update the TXD pointer internally (given that the description reads: "EVENTS_TXPTRUPD: TDX.PTR register has been copied to internal double-buffers. When the I2S module is started and TX is enabled, this event will be generated for every ceil(RXTXD.MAXCNT/4) words that are sent on the SDOUT pin." it should, trigger on every start when MAXCNT=1 - or did I misread?

Alignment is a problem if you plan to send only 1 byte:

https://docs.nordicsemi.com/bundle/ps_nrf54L15/page/i2s.html#ariaid-title9

It is recommended to transfer more than one byte if at all possible.

As you mention; DMA registers are double-buffered here, so as long as you update the registers before the current transaction is done; it shall be latched automatically by the hardware peripheral.

Kind regards,

Håkon
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel