This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Central/peripheral mixed devices keep bricking/dying

I have now had two units exhibit nearly the exact same behavior, and it seems odd to me for them to be so similar.  I'm wondering if anyone else has seen similar.  I have a two device system, they both use the same firmware, but once they have been setup by the host application, one of them (unit A) allows one peripheral and one central connection (the other peripheral connection I disallow in software) and the other (unit B) allows two peripheral connections (it never scans).  They bond to each other, connect to the phone, and play audio files via I2S.

I have now had two unit A's go bad - as in kaput - in much the same way.  They start by not connecting to the host (Android) quickly, then they start jumping to (seemingly) random memory locations after the I2S stops, then they advertise but stop connecting to both the phone and the other unit altogether and seem to get weird with the peripheral hardware (sensor reads stop working and will just hang).  Then that's it, they are bricks.

I thought it could be a powering issue - they are being powered through VDDH at 4.2V (li-po batt, fully charged) - but the voltages check out fine as 4.2V is in spec.  None of the peripheral components run on VDDH (they all communicate at 3.3V - so below the 3.9V max in the spec).  It only seems to happen with the mixed central/peripheral units so I'm wondering if something happens with the program memory or the NV used by the SD?  Does the SD do a lot of maintenance, enough that after a few weeks of constant code changes it kills that area of memory?  Has anyone else ever had a similar experience?

They are definitely getting used hard in terms of programming, easily a couple dozen times a day, sometimes full erases (i.e., flash included), sometimes not.  Like I said, it has been several weeks of this kind of use, but otherwise a clean lab without a history of ESD problems.  I'm kind of at a loss here as to what else it could be.

  • Hi,

    It sounds like you must have a form of memory corruption, as execution jumps to unexpected addresses. The SoftDevice does not write to flash by itself. This is probably a corruption of some data in RAM, but that could be caused by virtually anything (using uninitialized memory, stack overflow, writing to array out of bonds, etc. I suggest you start by inspecting your code to see if you can spot any dubious code.

  • Hey Einar,

    I hadn't accepted this answer yet as I wanted to try some things out.  It may well be that I have memory corruption issues, but I still think there is something more that has gone wrong here.  I figured out how to resume communications with my peripherals (a relatively simple mistake I made allowing the system to sleep when I needed the GPIO pins to stay high), but on my "bricked" units, I still cannot connect to them, which I think indicates that something incorrect has happened to the microcontroller.  Specifically, I have erased them, written programs to them that do not require the SoftDevice (which work as expected) and written different SD-reliant applications to them - none of those can connect, although they can advertise.  Is this expected behavior for RAM corruption?  That the SD will no longer function properly?

  • Hi,

    SmallerPond said:
    Is this expected behavior for RAM corruption?  That the SD will no longer function properly?

    If you have arbitrary parts of the RAM corrupted, then literally anything can happen. What happens (if anything) depends entirely on which parts of the memory gets invalid value.

    My idea that you have corrupt memory stems from this:

    then they start jumping to (seemingly) random memory locations

    It may not be the case though.

    SmallerPond said:
    Specifically, I have erased them, written programs to them that do not require the SoftDevice (which work as expected) and written different SD-reliant applications to them - none of those can connect, although they can advertise.

    Have you done any debugging? Does the central not try to connect, or does the connection fail to be established? Do you have a sniffer trace? A debug log from the nRF?

  • Hey Einar,

    Thanks for replying.  Sorry, I think my initial post might have been confusing.  When I say, the SD will no longer function properly, I mean that it will never again function properly with any application - not that it will stop functioning for now.

    I have done oodles of debugging, including packet sniffing.  Setting breakpoints doesn't turn up any errors - when the devices refuse to connect, they otherwise run fine.  In Wireshark, I see the following on devices that will advertise but no longer connect:

    ws_cap01_not_working.pcapng

    where you can see that the device just advertises without ever connecting.  The exact same code on a separate unit, connecting to the same host shows similar behavior but then connects:

    ws_cap01_working.pcapng

    In both cases I had the device on and nRF Connect open on my phone trying to connect.

    From here it is a little difficult to figure out where to go.  No matter which project is loaded on a "bad" unit, I see this same behavior - advertising but never connecting, and all other functionality (LEDs, sensors, SPI flash IC, I2S, etc.) working as expected.  The only difference between these two units is that for a while, the bad unit also functioned as a central device.  However, after it stopped connecting (which happened simultaneously in both the central and peripheral roles) it was erased and loaded with the peripheral-only code (as well as loading it with other working projects as a sanity check).  We have another unit that has started showing the same behavior.  It was used as a central and now will only connect to the J-Link intermittently (a symptom I forgot to mention).  I expect that it will stop connecting via BLE in the next day or so.

    As it stands now, I am afraid to tell the client to continue testing firmware since I am pretty sure that it will brick half of their units in the near term.  Is there something else I can try to get connections working again?  I've used nrfjprog --recover to try and reset them, but I don't see why that would be more effective than erasing via J-Link.

Related