This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

High current state that persists across SW reset?

I'm experiencing a strange issue that I can't seem to figure out.  We've built a location system based around the nrf52833, and we're currently running this system in a pilot installation at a customer site.  Tags (consisting of a nrf52833, a LIS3DH accelerometer--which is disabled, and a CR2477 battery) are mounted on short flags with strong magnetic bases--the tags are mounted about 8" or so from the magnets.  The flags are stuck to/pulled from heavy equipment moving around a very large shop floor.  In addition to location, the tags periodically uplink their battery voltage.  What we've found is that every once in a while, a tag will enter an anomalous state in which it suddenly begins to draw a constant high current (as evidenced by the battery voltage), and dies within 10 days or so.  I'd estimate the current to be around 3mA, which is consistent with the processor not entering its low power idle state.  However--and here's the part that's confounding me--we can remotely reset these tags, and doing so does NOT cause the behavior to go away. The only way to 'fix' the issue is to physically remove the battery and put it back in.  I've pored over our code, and there is nothing I can see that would persist across a SW reset--yet would be cleared via a power cycle--that could prevent the processor from sleeping.  
Some other important notes:

  • the accelerometer is an obvious culprit--since it wouldn't be reset when the 833 is--but it's disabled, and I've also confirmed that an improperly handled interrupt does not cause the processor to remain on (the int is edge triggered, so even if it's not cleared on the accelerometer it wouldn't fire again).  There is also no mode of operation of the accelerometer itself that would draw this kind of current.
  • Normal operation continues after the error manifests, and timing is unaffected.  No resets or other anomalous behavior occurs when the event starts--it just suddenly happens
  • The issue only occurs when the tag is moved--and I believe only when the flag is taken off/put on the equipment.  This makes me think some kind of mechanical shock--the magnets are quite strong--causes the problem, but tags that I've inspected don't have any obvious damage (and replacing the battery causes them to return to normal operation)
  • The programming header pads (SCLK,SDIO, VCC, GND) are close to the battery, but none of the ones I inspected were touching it.  It may be possible for the battery to very briefly touch the pads under the right conditions
  • I've considered other rogue interrupts, but the only ones in use are: the GPIO for the accelerometer, the SAADC (used to sample the battery voltage), the radio (I know it's not active because the code toggles the POWER register after each use and ensures all ints are cleared), the WDT, and SWI0 for the app timer.  Also: seems like a reset would clear any stray interrupts.
  • We've been completely unable to replicate the problem, and we've never run across this in any prior testing.

Through a run of ridiculously bad luck I've been unable to get my hands on a tag that is both alive and currently malfunctioning, so I haven't been able to tie into one yet.  That's my next move (unfortunately I have to wait for one to exhibit the problem then arrange travel and fly out to do it), but in the meanwhile I'm curious if anyone is aware of anything in the nordic that could cause this.  My current lines of thought are:

  • Sudden mechanical shock causes battery to shift and briefly contact programming pads, which somehow places tag in debug mode...?
  • same as above, but battery either momentarily loses contact with holder or contacts VCC/GND, which causes the PMU to enter a strange state

I've attached an image of the malfunction occurring.  About a third of the way across, you can see the Vbat suddenly drops and then begins a steady ramp down.  The start roughly coincides with the tag moving (note the change in X/Y).  I believe in this instance the equipment the tag was on moved to a new workstation, then shortly after the flag was pulled off and moved nearby (workers have to do that occasionally) and that's when it started.  

Until I can get my hands on one that's currently malfunctioning I'm stumped, so thanks in advance for any ideas!

Mark

Parents
  • Hello,

    This is pretty strange - the device is operating normally, does not use the I2C, and also does not recover through a soft reset. Maybe it could be the position of the battery even though it seems unlikely.

    As you indicated, peripherals and interrupt states will generally not persist across soft resets. One exception is the debug interface (see Debug Interface mode), but I think you have to be quite unlucky to get it enabled by some random clock pulses ( assuming the bottom side of the coincell can come into contact with the pads - the SWD lines have internal pull-ups).

    Do you use System OFF mode in your application? I'm asking because it's not possible to enter this mode while the debug interface is enabled.

    Best regards,

    vidar

  • On the battery position: the bad luck I was referring to was that I received some malfunctioning tags, but they were dropped off late and sat on my porch overnight in very cold weather, which caused the batteries to droop to 1.5V.  After I brought them in the voltage rose to 2.6V on some, but they didn't kick back on.  I don't have access to the reset lines, so I very briefly shorted the battery--without shifting its position--and they came back and the current looked OK (evidenced by the battery voltage continuing to rise over time).  So if it is something related to battery position, it seems to be a transient event that in turn causes a persistent state.
    We're not using system OFF.
    The persistence across reset is really stumping me too.  That and the fact that the tag appears to be running perfectly normal otherwise--timing, location updates, etc.--makes me feel like it's not just a bug keeping the processor awake.  And while I wouldn't say the issue is common, it's also not exceptionally rare--out of the 40 or so active tags there's anywhere from 1-5 that are exhibiting it at any given moment.  Also interesting: I don't think we've seen any tags that have had the issue recur after replacing the batteries, which again would imply some transient triggering event.
    We do have the ability to remotely FOTA these tags, so if there's anything we can try let me know.   
    One thought: is there a way to emulate a power down reset in terms of register clearing/device state?  It might help narrow down the problem, and until we figure out the issue it would at least be nice to be able to recover tags without having to physically touch them.

  • Hi Vidar,

    Yes, I think I can have them report idle time.  As soon as I catch one malfunctioning I'll FOTA it with a version that uplinks that.

    Thanks!
    Mark

  • Hi Vidar,

    Two other questions I just thought of:

    1. Is there a way for the application firmware to know if the debug interface is active?  E.g. a status register?
    2. We were brainstorming here and one question that came up: it's possible this issue manifests when the flags and tags are placed on a rack of other flags.  When that occurs, the tag may cut through some relatively strong magnetic fields--do you know if it's possible for that to cause unexpected behavior on the processor?  I've tried replicating it here and haven't seen anything.

    Thanks!

    Mark

  • Hi Vidar,

    I've ID'd two tags that started exhibiting the issue last night.  This morning we FOTA'd both tags to a version that tracks and uplinks the percentage of time they spend in the sleep state.  Both tags are reporting that they're spending >99% of their time asleep (I realized too late that my code truncates the decimal instead of rounding, so I'm guessing it's closer to 100% than 99%).  So I don't think there's a rogue or unhandled interrupt keeping the processor awake.  Notably, the error state also persisted through the FOTA process, which entails several resets.  
    If you can think of any other register values that might help inform what's going on, please let me know ASAP and we can program the tag to uplink them before it dies--which will probably be in the next week or so.  
    I think this obviates my question about polling the state of the debug interface, but I am still curious about any possible effects from cutting through strong magnetic fields.  I think both tags got into this state when they were placed in a rack with other tags (which are mounted to the flags that have strong magnetic bases).

    EDIT: if somehow the debug port was activated I think the tag could still sleep, so if there is a way for the application to tell that debug is active--or a way for the application to preemptively clear it even without checking--I'd still like to know

    Thanks!
    Mark

  • Hi Mark,

    It's good to have confirmed that it does indeed enter sleep. Then there shouldn't be any unhanded interrupts/events sources in your code as you said.

    Regarding magnetic fields, I have not been able to find other reports where strong fields have been shown to have had adverse effects on 52 series devices. Martin answered a somewhat related question in this thread: https://devzone.nordicsemi.com/f/nordic-q-a/33295/does-the-nrf52832-have-ferromagnetic-components.

    Mark said:
    if somehow the debug port was activated I think the tag could still sleep, so if there is a way for the application to tell that debug is active--or a way for the application to preemptively clear it even without checking--I'd still like to know

    Yes, the device will be able to enter sleep (System ON mode) in debug mode, but with an elevated sleep current. I just measured it to around 1.5 mA here on my desk with a nRF52833DK. So maybe a bit low to fully explain discharge curve?

    It can be difficult to check have the FW determine whether the chip is in debug interface mode or not. You have the C_DEBUGEN bit in CoreDebug->DHCSR (link), but this register is only reset after a power-on reset (i.e. it will remain set after the debugger has exited debug mode), so it only makes sense to check this status bit if your devices have been power-cycled after they where programmed.  And It's not possible to exit debug interface mode programmatically.

    I have been discussing this case with some of my colleagues to try come up with ideas. Will let you know if we can think of anything else to try.

    Best regards,

    Vidar

  • Hi Vidar,

    It looks like we've found the culprit: the malfunctioning tags came from an early prototype run, which had an external flash IC that we've since not been populating.  The firmware wasn't doing anything with the pins--they were floating, in fact--and it appears that under certain circumstances the flash is entering into a high current state.  I'm not 100% sure of the mechanism, since it primarily occurs in a couple of locations--maybe it is something with the magnets.  In any case, we FOTA'd a version of code that sets the pins for the flash properly and the batteries on the affected tags are recovering. 
    Thanks so much for your assistance; I'm planning on keeping some of your suggestions in the code (e.g. having the tags report the time spent in idle).  
    One random thing to report: I implemented the C_DEBUGEN check, and on the board I tried it on at least the bit did clear without a power-cycle after I disconnected the debugger.

    Thanks again!
    Mark

Reply
  • Hi Vidar,

    It looks like we've found the culprit: the malfunctioning tags came from an early prototype run, which had an external flash IC that we've since not been populating.  The firmware wasn't doing anything with the pins--they were floating, in fact--and it appears that under certain circumstances the flash is entering into a high current state.  I'm not 100% sure of the mechanism, since it primarily occurs in a couple of locations--maybe it is something with the magnets.  In any case, we FOTA'd a version of code that sets the pins for the flash properly and the batteries on the affected tags are recovering. 
    Thanks so much for your assistance; I'm planning on keeping some of your suggestions in the code (e.g. having the tags report the time spent in idle).  
    One random thing to report: I implemented the C_DEBUGEN check, and on the board I tried it on at least the bit did clear without a power-cycle after I disconnected the debugger.

    Thanks again!
    Mark

Children
  • Hi Mark,

    I appreciate you taking the time to post this update! It's good hear you found the problem, and that it turned out to be something you could resolve through a FW update.

    Mark said:
    I implemented the C_DEBUGEN check, and on the board I tried it on at least the bit did clear without a power-cycle after I disconnected the debugger.

    The ARM documentation stated that it is not reset on a system reset, but by a power-on-reset. However, when testing it here, I observe that pinreset had the same effect. Maybe you used pinreset when disconnecting the debugger, and not a soft reset (i.e. debug reset).

    Best regards,

    Vidar

Related