SPI transfers cause BLE connection timeout, but only on some units and especially when cold?

I'm running into a wall trying to find the root cause of a problem that we only started noticing after a small production run of a custom device built around the nRF52840 (specifically the Rigado/u-blox BMD-340 module).

In short, we have an IMU on the board (Hillcrest/CEVA BNO085) communicating over SPI at 2MHz, and when we actually use the IMU (read ~16-byte bursts about 400 times/sec), the BLE connection drops with a 0x08 timeout error. If we don't communicate with the IMU, we can maintain a stable connection as long as we want. Both the IMU and the Bluetooth module are being driven by the same external SiTime 32768Hz TCXO (SIT1552AI-JE-DCC-32.768E). This signal looks perfectly clean on all units, as far as I can measure--though I may not have the right equipment to measure adequately.

The problem only happens on about 5% of the units we built, and not always in equal severity. Most of our devices have no issues. The ones that exhibit this problem often do so rarely, with only two that we've found so far exhibiting it every time, right away, as soon as we connect and start using the IMU. I have one of these on my desk in a test jig.

I'm using SoftDevice S140, v7.2.0, SDK 17.1.0. In addition to the radio, I've also configured TWI(0), SPI(1), and SPI(2). Currently, only SPI(1) is actively used. SPI(1) goes to a different peripheral IC that is currently not implemented in firmware. In case it's relevant, the SPI pins in question are P1.13 (MISO), P1.14 (MOSI), and P1.15 (SCK). The CS pin for the slave device in question is P1.12.

Let me walk you through my troubleshooting efforts.

1. I am certain the disconnection reason given is a supervision timeout. I have plenty of debug output in place to confirm this. Other than that error code in the disconnection event, there aren't any other helpful SoftDevice debug logs generated, even when the SD log level is set to debug (4).

2. I've captured multiple sniffer traces using a nRF52840 USB dongle and Wireshark, but nothing helpful came of that. All it shows is normal communication up until the peripheral simply stops transmitting, then resumes advertising as intended after a disconnection. There's no catastrophic crash, hard fault, or watchdog reset (WDT expires after 2 seconds); the firmware seems to keep humming along fine other than the loss of BLE communication, but it will happily reconnect again afterwards.

3. I've tried sending dummy data over the BLE connection at the same rate as what we capture and process from the IMU (~3200 bytes/sec), and the data transfer itself doesn't appear to cause any issues. I can do that all day long, and it stays connected. Further, if I simply gather data from the IMU over SPI but don't send it over the air, the connection still drops.

4. My non-blocking SPI transfer uses a volatile "xfer_done" boolean in the event handler, and uses "sd_app_evt_wait()" in a while loop until the transfer finishes to ensure the SD doesn't get ignored. I was previously using "_WFE()" instead, and I was really hoping that change would fix it, but it had no effect.

5. I've tried using both blocking and non-blocking SPI master implementations. My sdk_config.h has EASY_DMA enabled, but I'm not using the new NRFX_SPIM implementation. Should I change this?

6. After stumbling across this post from earlier in 2023, I tried using a heat gun on my test device to warm it up significantly (probably 40 deg. C, room temperature is more like 25 deg. C), AND IT STARTED WORKING. Once it cooled down again, the instant-timeout issue came back. This is the most interesting result so far, because it actually explains some behavior we saw but couldn't figure out--namely, the first time we really noticed the problem was during some tests involving outdoor use, where it was cooler than indoors. BUT WHY? Although I can see why temperature could affect a clock signal and therefore link stability, why would it ONLY matter if the IMU is in use via the SPI peripheral?

7. Changing the LFCLK source to SYNTH in sdk_config.h eliminates the problem entirely, at the expense of current consumption. This is why I still suspect that the clock has something to do with it. But I can't figure out what could be wrong.

I'm at a loss what to measure, test, or try at this point. Do any of you wonderful people have any ideas? I am happy to provide more detail on any point if needed.

Parents
  • Following up on my reply yesterday, often issues regarding suspected loss of timing for the BLE transactions can be determined by adding a self-test of the external 32kHz clock to alert when that clock input behaves unexpectedly. This can be done by deriving a time interval from one of the RTC CC conditions, either dedicated or simply leveraged off an existing CC, and deriving a comparable interval from a known-good timing reference. For testing, the known-good timing reference can use the nRF52840 32MHz crystal, although that has to be kept running during the tests which increases power consumption. Another comparison source is to derive a known interval from the external EMU, which although using the same TCXO is likely to behave differently than the nRF52840. The EMU keeps internal time, though reading that out might require some creative thinking although datasheet does state "Timestamps attached to sensor reports". If using a fixed update rate triggered by the EMU, simply using the Packet Sequence Number from the EMU would give an interpreted accurate timing source which would be compared with the nRF52840 RTC providing continuous monitoring. As an aside, make sure the 'scope probes are set to 10x and are calibrated before providing the waveform as mentioned in my other reply.

Reply
  • Following up on my reply yesterday, often issues regarding suspected loss of timing for the BLE transactions can be determined by adding a self-test of the external 32kHz clock to alert when that clock input behaves unexpectedly. This can be done by deriving a time interval from one of the RTC CC conditions, either dedicated or simply leveraged off an existing CC, and deriving a comparable interval from a known-good timing reference. For testing, the known-good timing reference can use the nRF52840 32MHz crystal, although that has to be kept running during the tests which increases power consumption. Another comparison source is to derive a known interval from the external EMU, which although using the same TCXO is likely to behave differently than the nRF52840. The EMU keeps internal time, though reading that out might require some creative thinking although datasheet does state "Timestamps attached to sensor reports". If using a fixed update rate triggered by the EMU, simply using the Packet Sequence Number from the EMU would give an interpreted accurate timing source which would be compared with the nRF52840 RTC providing continuous monitoring. As an aside, make sure the 'scope probes are set to 10x and are calibrated before providing the waveform as mentioned in my other reply.

Children
No Data
Related