SPI transfers cause BLE connection timeout, but only on some units and especially when cold?

I'm running into a wall trying to find the root cause of a problem that we only started noticing after a small production run of a custom device built around the nRF52840 (specifically the Rigado/u-blox BMD-340 module).

In short, we have an IMU on the board (Hillcrest/CEVA BNO085) communicating over SPI at 2MHz, and when we actually use the IMU (read ~16-byte bursts about 400 times/sec), the BLE connection drops with a 0x08 timeout error. If we don't communicate with the IMU, we can maintain a stable connection as long as we want. Both the IMU and the Bluetooth module are being driven by the same external SiTime 32768Hz TCXO (SIT1552AI-JE-DCC-32.768E). This signal looks perfectly clean on all units, as far as I can measure--though I may not have the right equipment to measure adequately.

The problem only happens on about 5% of the units we built, and not always in equal severity. Most of our devices have no issues. The ones that exhibit this problem often do so rarely, with only two that we've found so far exhibiting it every time, right away, as soon as we connect and start using the IMU. I have one of these on my desk in a test jig.

I'm using SoftDevice S140, v7.2.0, SDK 17.1.0. In addition to the radio, I've also configured TWI(0), SPI(1), and SPI(2). Currently, only SPI(1) is actively used. SPI(1) goes to a different peripheral IC that is currently not implemented in firmware. In case it's relevant, the SPI pins in question are P1.13 (MISO), P1.14 (MOSI), and P1.15 (SCK). The CS pin for the slave device in question is P1.12.

Let me walk you through my troubleshooting efforts.

1. I am certain the disconnection reason given is a supervision timeout. I have plenty of debug output in place to confirm this. Other than that error code in the disconnection event, there aren't any other helpful SoftDevice debug logs generated, even when the SD log level is set to debug (4).

2. I've captured multiple sniffer traces using a nRF52840 USB dongle and Wireshark, but nothing helpful came of that. All it shows is normal communication up until the peripheral simply stops transmitting, then resumes advertising as intended after a disconnection. There's no catastrophic crash, hard fault, or watchdog reset (WDT expires after 2 seconds); the firmware seems to keep humming along fine other than the loss of BLE communication, but it will happily reconnect again afterwards.

3. I've tried sending dummy data over the BLE connection at the same rate as what we capture and process from the IMU (~3200 bytes/sec), and the data transfer itself doesn't appear to cause any issues. I can do that all day long, and it stays connected. Further, if I simply gather data from the IMU over SPI but don't send it over the air, the connection still drops.

4. My non-blocking SPI transfer uses a volatile "xfer_done" boolean in the event handler, and uses "sd_app_evt_wait()" in a while loop until the transfer finishes to ensure the SD doesn't get ignored. I was previously using "_WFE()" instead, and I was really hoping that change would fix it, but it had no effect.

5. I've tried using both blocking and non-blocking SPI master implementations. My sdk_config.h has EASY_DMA enabled, but I'm not using the new NRFX_SPIM implementation. Should I change this?

6. After stumbling across this post from earlier in 2023, I tried using a heat gun on my test device to warm it up significantly (probably 40 deg. C, room temperature is more like 25 deg. C), AND IT STARTED WORKING. Once it cooled down again, the instant-timeout issue came back. This is the most interesting result so far, because it actually explains some behavior we saw but couldn't figure out--namely, the first time we really noticed the problem was during some tests involving outdoor use, where it was cooler than indoors. BUT WHY? Although I can see why temperature could affect a clock signal and therefore link stability, why would it ONLY matter if the IMU is in use via the SPI peripheral?

7. Changing the LFCLK source to SYNTH in sdk_config.h eliminates the problem entirely, at the expense of current consumption. This is why I still suspect that the clock has something to do with it. But I can't figure out what could be wrong.

I'm at a loss what to measure, test, or try at this point. Do any of you wonderful people have any ideas? I am happy to provide more detail on any point if needed.

  • I looked at the BMD340 and the SCK and Osc pins are pretty close together, plus the P1.15 SCK is unsuitable for any kind of SPI clock; it is close to the radio and will affect as well as be affected by transmissions. So .. the fix - if shorting the 1k doesn't work - may well be to use an unrestricted P0.nn i/o pin for SCK (and probably MOSI and MISO as well). (Duplicated to avoid losing text)

  • Wow, that's...really important to know, and I totally missed it during the design phase. Particularly the P1.15 clock speed limitation. Clearly, some revisions are in order if we want to follow the design guidelines more closely.

    I tried shorting the 1k series resistor, and it didn't result in any improvement. Nuts.

    Oddly though, despite this note in the datasheet, that pin seems to have no problem driving the IMU's SPI interface at 2MHz.We've got nearly a year of testing with about a hundred of these all over the place, and nobody's complained about bad data (ever) or this connection issue (until very recently).

    However, I'm still confused about why simply changing the temperature would have such a dramatic effect on whether it happens. Similarly, why does switching the LFCLK source to be synthesized from the HFCLK eliminate the issue as well? If the problem is interference between a misallocated I/O pin and the radio, shouldn't that still cause the issue to occur while the IMU is busily passing data along? Maybe the at-risk signals only concern a non-synthesized clock.

    A design change isn't in the cards for the immediate future, so it looks like I need to figure out the least painful way to rely on the HFCLK during connections at this point.

    Thanks for your help!

  • jrowberg said:
    A design change isn't in the cards for the immediate future, so it looks like I need to figure out the least painful way to rely on the HFCLK during connections at this point.

    Have you checked if the LFRC clock source eliminates the problem similar to LFSYNT? This is the clock source that is normally used in designs that does not include a crystal. The current consumption of LFRC is slightly higher than LFXO due to calibration towards HFCLK and window widening in connections from worse tolerance, but it is far from the current used by LFSYNT. LFSYNT is also not well tested and is not recommended for use with the SoftDevice.

  • I had not tried the RC option, because early on (in desperate random troubleshooting) I encountered errors when switching to it, and then tried HFCLK instead with apparent success. However, I've just changed to RC, and it appears to be reliable as well. I've held a connection for 10 minutes with no data transmission problems so far, and I'll keep letting it run for an extended test. Thanks for the tip! This will be much better than HFCLK if it keeps working.

    For what it's worth, here's a data point for you concerning the synthesized-from-HFCLK source: I left our "bad" device connected and transmitting data overnight with the SYNTH config, and it stayed 100% solid, delivering over 140 MB of dummy data over BLE in about 12 hours. This is the same unit that couldn't stay connected for 1 second with the XTAL input, otherwise the same config (including running 2MHz SPI over officially unrecommended pins near the radio).

Related