SPI double buffering on TX

I have a use case where I need to send a sequence of 2 byte bursts as SPI master. The data is generated on the fly and the algorithm would be able to saturate the 8Mbit/s SPI on nRF52832, so I'd like to optimize the throughput and latency by interleaving the data generation with the communication.

If I use the naive approach (I don't need the incoming data, in fact, MISO is not wired at all) of:

      NRF_SPI0->TXD = data >> 8;
      while(!NRF_SPI0->EVENTS_READY);
      (void)NRF_SPI0->RXD;
      NRF_SPI0->EVENTS_READY = 0x0UL;

      NRF_SPI0->TXD = (uint8_t)data;
      while(!NRF_SPI0->EVENTS_READY);
      (void)NRF_SPI0->RXD;
      NRF_SPI0->EVENTS_READY = 0x0UL;

I am getting about 1.3us between the bytes and thus the overall transfer rate of ~3Mbit/s

Since my bursts are just 2B, following the section 48.1.3 of the nRF52832 PS, I should be able to just dump both bytes to the spi->TXD, but I couldn't get that work reliably. Actually, I can, with an explicit delay, as in:

    [the loop] {
      uint16_t data = produce();
      NRF_SPI0->TXD = data >> 8;
      NRF_SPI0->TXD = (uint8_t)data;
      delay_us(2);
    }

it pretty much works, with the SPI peripheral generating nice back-to-back 16tick transfer and about 0.8us between the bursts (a 2-byte transfer starting about every 2.8us, since I have 2us wait and ~0.8us to generate the next 16 bits), reaching close to 7Mbit/s

The trouble is, I can't find a reliable way to wait before starting the next 16bit burst besides that 2us delay. With the explicit delay, I am wasting time that could have been used to produce() (since 0.8us is the best case and sometimes it takes longer).

In an ideal case, I'd be able to do something like:

[the loop] {
    uint16_t data = produce();
    wait_for_SPI_idle();
    NRF_SPI0->TXD = data >> 8;
    NRF_SPI0->TXD = (uint8_t)data;
}

But no matter how I play with NRF_SPI0->EVENTS_READY, I can't construct a reliable "wait_for_SPI_idle()" or similar functionality.

Any idea?