This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

High current state that persists across SW reset?

I'm experiencing a strange issue that I can't seem to figure out.  We've built a location system based around the nrf52833, and we're currently running this system in a pilot installation at a customer site.  Tags (consisting of a nrf52833, a LIS3DH accelerometer--which is disabled, and a CR2477 battery) are mounted on short flags with strong magnetic bases--the tags are mounted about 8" or so from the magnets.  The flags are stuck to/pulled from heavy equipment moving around a very large shop floor.  In addition to location, the tags periodically uplink their battery voltage.  What we've found is that every once in a while, a tag will enter an anomalous state in which it suddenly begins to draw a constant high current (as evidenced by the battery voltage), and dies within 10 days or so.  I'd estimate the current to be around 3mA, which is consistent with the processor not entering its low power idle state.  However--and here's the part that's confounding me--we can remotely reset these tags, and doing so does NOT cause the behavior to go away. The only way to 'fix' the issue is to physically remove the battery and put it back in.  I've pored over our code, and there is nothing I can see that would persist across a SW reset--yet would be cleared via a power cycle--that could prevent the processor from sleeping.  
Some other important notes:

  • the accelerometer is an obvious culprit--since it wouldn't be reset when the 833 is--but it's disabled, and I've also confirmed that an improperly handled interrupt does not cause the processor to remain on (the int is edge triggered, so even if it's not cleared on the accelerometer it wouldn't fire again).  There is also no mode of operation of the accelerometer itself that would draw this kind of current.
  • Normal operation continues after the error manifests, and timing is unaffected.  No resets or other anomalous behavior occurs when the event starts--it just suddenly happens
  • The issue only occurs when the tag is moved--and I believe only when the flag is taken off/put on the equipment.  This makes me think some kind of mechanical shock--the magnets are quite strong--causes the problem, but tags that I've inspected don't have any obvious damage (and replacing the battery causes them to return to normal operation)
  • The programming header pads (SCLK,SDIO, VCC, GND) are close to the battery, but none of the ones I inspected were touching it.  It may be possible for the battery to very briefly touch the pads under the right conditions
  • I've considered other rogue interrupts, but the only ones in use are: the GPIO for the accelerometer, the SAADC (used to sample the battery voltage), the radio (I know it's not active because the code toggles the POWER register after each use and ensures all ints are cleared), the WDT, and SWI0 for the app timer.  Also: seems like a reset would clear any stray interrupts.
  • We've been completely unable to replicate the problem, and we've never run across this in any prior testing.

Through a run of ridiculously bad luck I've been unable to get my hands on a tag that is both alive and currently malfunctioning, so I haven't been able to tie into one yet.  That's my next move (unfortunately I have to wait for one to exhibit the problem then arrange travel and fly out to do it), but in the meanwhile I'm curious if anyone is aware of anything in the nordic that could cause this.  My current lines of thought are:

  • Sudden mechanical shock causes battery to shift and briefly contact programming pads, which somehow places tag in debug mode...?
  • same as above, but battery either momentarily loses contact with holder or contacts VCC/GND, which causes the PMU to enter a strange state

I've attached an image of the malfunction occurring.  About a third of the way across, you can see the Vbat suddenly drops and then begins a steady ramp down.  The start roughly coincides with the tag moving (note the change in X/Y).  I believe in this instance the equipment the tag was on moved to a new workstation, then shortly after the flag was pulled off and moved nearby (workers have to do that occasionally) and that's when it started.  

Until I can get my hands on one that's currently malfunctioning I'm stumped, so thanks in advance for any ideas!

Mark

  • One other note: in the units I've looked at, the battery is shifted to some extent from the holder, but there's a foam block that prevents it from moving too far.  It does imply that the tags are seeing some kind of shock--enough to shift the battery, at least.

  • Hello,

    This is pretty strange - the device is operating normally, does not use the I2C, and also does not recover through a soft reset. Maybe it could be the position of the battery even though it seems unlikely.

    As you indicated, peripherals and interrupt states will generally not persist across soft resets. One exception is the debug interface (see Debug Interface mode), but I think you have to be quite unlucky to get it enabled by some random clock pulses ( assuming the bottom side of the coincell can come into contact with the pads - the SWD lines have internal pull-ups).

    Do you use System OFF mode in your application? I'm asking because it's not possible to enter this mode while the debug interface is enabled.

    Best regards,

    vidar

  • On the battery position: the bad luck I was referring to was that I received some malfunctioning tags, but they were dropped off late and sat on my porch overnight in very cold weather, which caused the batteries to droop to 1.5V.  After I brought them in the voltage rose to 2.6V on some, but they didn't kick back on.  I don't have access to the reset lines, so I very briefly shorted the battery--without shifting its position--and they came back and the current looked OK (evidenced by the battery voltage continuing to rise over time).  So if it is something related to battery position, it seems to be a transient event that in turn causes a persistent state.
    We're not using system OFF.
    The persistence across reset is really stumping me too.  That and the fact that the tag appears to be running perfectly normal otherwise--timing, location updates, etc.--makes me feel like it's not just a bug keeping the processor awake.  And while I wouldn't say the issue is common, it's also not exceptionally rare--out of the 40 or so active tags there's anywhere from 1-5 that are exhibiting it at any given moment.  Also interesting: I don't think we've seen any tags that have had the issue recur after replacing the batteries, which again would imply some transient triggering event.
    We do have the ability to remotely FOTA these tags, so if there's anything we can try let me know.   
    One thought: is there a way to emulate a power down reset in terms of register clearing/device state?  It might help narrow down the problem, and until we figure out the issue it would at least be nice to be able to recover tags without having to physically touch them.

  • Longshot: the symptoms are identical to those of an unhandled event, typically caused by the FPU which are not necessarily cleared by a software reset. The effect is that the call to sd_app_evt_wait() returns immediately so the CPU never sleeps. This can be confirmed by putting a port pin driven low on entry to sd_app_evt_wait() and high on return; if it is mostly high it shows a pending event is preventing sleep. 3mA is suspiciously close to the MPU 5mA or so assuming the DC-DC convertor is running so battery consumption is approx 5mA * (1.3/3). Trouble is that assumes you can recreate the problem in the lab.

    There is a secret power on/off register for the MPU (I forget the address but could find it if needed), which clears the pending events, or the pending events can just be cleared on entry to sleep (can check if set or just don't care, clear anyway). Quite how the pending event got set is irrelevant, though if you are using the hardware floating point it may be some extreme calculation caused by physical shock on the accelerometer. If not using the h/w FPU then some other peripheral could do the same, though most peripherals are cleared by a soft reset (not FPU).

    // Floating-point Status Control Register - Cortex-M4 FPU
    // ======================================
    // 7 - IDC Input Denormal cumulative exception bit, see bits [4:0].
    // 6 - Reserved
    // 5 - Reserved
    //   IXC Cumulative exception bits for floating-point exceptions, see also bit [7]. Each of these bits is
    //   set to 1 to indicate that the corresponding exception has occurred since 0 was last written to it
    // 4 - IXC Inexact cumulative exception
    // 3 - UFC Underflow cumulative exception
    // 2 - OFC Overflow cumulative exception
    // 1 - DZC Division by Zero cumulative exception
    // 0 - IOC Invalid Operation cumulative exception
    // Set bit 7 and bits 4..0 in the mask to one (0x ...00 1001 1111)
    #define FPU_EXCEPTION_MASK 0x0000009F
    
        // Clear any exceptions and PendingIRQ from the FPU unit
        __set_FPSCR(__get_FPSCR()  & ~(FPU_EXCEPTION_MASK));
        __DSB();
        NVIC_ClearPendingIRQ(FPU_IRQn);
        __DSB();
        sd_app_evt_wait();

    Edit: There is a sdk_config.h setting which might do this handling for you, if needed (I haven't tried it):

    NRF_PWR_MGMT_CONFIG_FPU_SUPPORT_ENABLED  - Enables FPU event cleaning

  • Thanks for the suggestions!  I do have the fpu irq clear before __WFE() in there already:

    void _fpu_irq_clr(void)
    {
        uint32_t fpscr_reg = __get_FPSCR();
    
        __set_FPSCR(fpscr_reg & ~(0x0000009F));
        (void) __get_FPSCR();
        NVIC_ClearPendingIRQ(FPU_IRQn);
    }

    I don't have the __DSB() instructions though--are they required?
    The MPU reset/event clear register you mentioned would be great--I can add it and FOTA a tag when it malfunctions, and if that recovers it that would be an indicator that an unhandled event is keeping the processor awake.

Related