SPI transfers cause BLE connection timeout, but only on some units and especially when cold?

Question

I'm running into a wall trying to find the root cause of a problem that we only started noticing after a small production run of a custom device built around the nRF52840 (specifically the Rigado/u-blox BMD-340 module). 
 In short, we have an IMU on the board ( Hillcrest/CEVA BNO085 ) communicating over SPI at 2MHz, and when we actually use the IMU (read ~16-byte bursts about 400 times/sec), the BLE connection drops with a 0x08 timeout error. If we don't communicate with the IMU, we can maintain a stable connection as long as we want. Both the IMU and the Bluetooth module are being driven by the same external SiTime 32768Hz TCXO ( SIT1552AI-JE-DCC-32.768E ). This signal looks perfectly clean on all units, as far as I can measure--though I may not have the right equipment to measure adequately. 
 The problem only happens on about 5% of the units we built, and not always in equal severity . Most of our devices have no issues. The ones that exhibit this problem often do so rarely, with only two that we've found so far exhibiting it every time, right away, as soon as we connect and start using the IMU. I have one of these on my desk in a test jig. 
 I'm using SoftDevice S140, v7.2.0, SDK 17.1.0. In addition to the radio, I've also configured TWI(0), SPI(1), and SPI(2). Currently, only SPI(1) is actively used. SPI(1) goes to a different peripheral IC that is currently not implemented in firmware. In case it's relevant, t he SPI pins in question are P1.13 (MISO), P1.14 (MOSI), and P1.15 (SCK). The CS pin for the slave device in question is P1.12. 
 Let me walk you through my troubleshooting efforts. 
 1. I am certain the disconnection reason given is a supervision timeout. I have plenty of debug output in place to confirm this. Other than that error code in the disconnection event, there aren't any other helpful SoftDevice debug logs generated, even when the SD log level is set to debug (4). 
 2. I've captured multiple sniffer traces using a nRF52840 USB dongle and Wireshark, but nothing helpful came of that. All it shows is normal communication up until the peripheral simply stops transmitting, then resumes advertising as intended after a disconnection. There's no catastrophic crash, hard fault, or watchdog reset (WDT expires after 2 seconds); the firmware seems to keep humming along fine other than the loss of BLE communication, but it will happily reconnect again afterwards. 
 3. I've tried sending dummy data over the BLE connection at the same rate as what we capture and process from the IMU (~3200 bytes/sec), and the data transfer itself doesn't appear to cause any issues. I can do that all day long, and it stays connected. Further, if I simply gather data from the IMU over SPI but don't send it over the air, the connection still drops. 
 4. My non-blocking SPI transfer uses a volatile "xfer_done" boolean in the event handler, and uses " sd_app_evt_wait() " in a while loop until the transfer finishes to ensure the SD doesn't get ignored. I was previously using " _WFE() " instead, and I was really hoping that change would fix it, but it had no effect. 
 5. I've tried using both blocking and non-blocking SPI master implementations. My sdk_config.h has EASY_DMA enabled, but I'm not using the new NRFX_SPIM implementation. Should I change this? 
 6. After stumbling across this post from earlier in 2023, I tried using a heat gun on my test device to warm it up significantly (probably 40 deg. C, room temperature is more like 25 deg. C), AND IT STARTED WORKING. Once it cooled down again, the instant-timeout issue came back. This is the most interesting result so far, because it actually explains some behavior we saw but couldn't figure out--namely, the first time we really noticed the problem was during some tests involving outdoor use, where it was cooler than indoors. BUT WHY? Although I can see why temperature could affect a clock signal and therefore link stability, why would it ONLY matter if the IMU is in use via the SPI peripheral? 
 7. Changing the LFCLK source to SYNTH in sdk_config.h eliminates the problem entirely , at the expense of current consumption. This is why I still suspect that the clock has something to do with it. But I can't figure out what could be wrong. 
 I'm at a loss what to measure, test, or try at this point. Do any of you wonderful people have any ideas? I am happy to provide more detail on any point if needed.

hmolesworth · Accepted Answer

Looks correct; however I would try shorting out the 1k0 series resistor and see if there is any change. A periodic check in the firmware for the contents of LFCLKSRCCOPY and LFCLKSRC and LFCLKRUN would also be a good idea .. 
 .. I should add that with a 1k0 resistor the high-impedance nRF52840 clock input pin will experience more interference from a high-speed clock source such as SPI .. 
 Edit: I looked at the BMD340 and the SCK and Osc pins are pretty close together, plus the P1.15 SCK is unsuitable for any kind of SPI clock; it is close to the radio and will affect as well as be affected by transmissions. So .. the fix - if shorting the 1k doesn't work - may well be to use an unrestricted P0.nn i/o pin for SCK (and probably MOSI and MISO as well). 
 WLCSP Port BMD340 Function
===== ===== ====== =======================================================================================
B9 P0.00 Pin 13 XL1 Digital I/O Analog input General purpose I/O Connection for 32.768 kHz crystal
B10 P0.01 Pin 14 XL2 Digital I/O Analog input General purpose I/O Connection for 32.768 kHz crystal
C6 P1.15 Pin 64 Digital I/O General purpose I/O Standard drive, low frequency I/O only. Close to radio

Jørgen Holmefjord · Answer

jrowberg said: A design change isn't in the cards for the immediate future, so it looks like I need to figure out the least painful way to rely on the HFCLK during connections at this point. 
 Have you checked if the LFRC clock source eliminates the problem similar to LFSYNT? This is the clock source that is normally used in designs that does not include a crystal. The current consumption of LFRC is slightly higher than LFXO due to calibration towards HFCLK and window widening in connections from worse tolerance, but it is far from the current used by LFSYNT. LFSYNT is also not well tested and is not recommended for use with the SoftDevice.

SPI transfers cause BLE connection timeout, but only on some units and especially when cold?

Top Replies