URGENT : NRF52840 hanging up after working for some time!!!

Hi,

THIS IS AN URGENT REQUEST. PLEASE HELP ASAP.

After deploying our sensors based on nRF52840 running coap client on OpenThread and powered by 3V 2032 Lithium coin cell, we have noticed that two out of about 50 sensors just not sending data to the host. One of the sensors we know has hung up with no response to the button push and lo LED indication. The other have the characteristics of it, but not yet confirmed by the client. The issue is not immediate, some like couple weeks, some couple of days. We have seen two of our sensors, which are being tested in our lab, to hang up as well. We dont have an external watchdog. Not sure if the internal watchdog is enabled.

Once the sensor hangs up, you can make it come back alive by power cycling. 

Now we have an external 32k oscillator in our design, which showed stopping after some time probably due to capacitor loading not being correct. So we revert to internal oscillator, which seem to solve the issue. That is how these sensors been working so far and release to customers. But the external oscillator and two loading caps exist in the PCB. I dont know if this can be an issue if we use HFINT internal oscillator. 

We have enabled the internal oscillator by the following in the prj.conf.

CONFIG_CLOCK_CONTROL=y
CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y
CONFIG_CLOCK_CONTROL_NRF_K32SRC_XTAL=n

We have not done any calibrations for the HFINT. 

So these are my questions.

1. What possible causes could be there for the hang-up of nRF52840? As mentioned, it is powered by a 3V coin cell, which is not discharged and running off the internal HFINT oscillator. 

2. Can the 32k external oscillator being present in the sensor affect the stability of HFINT oscillator?

3. Is the way in which we have enabled the internal oscillator good enough? Do we need to do any calibrations for it?

4. Can the internal WDT restart the HFINT oscillator, if it has stopped for whatever reason?

5. How to enable, load and hit the internal WDT from application FW?

Cheers,

Kaushalya

Parents
  • Hi Kaushalya,

    Just a FYI first:

    • HFINT is the internal high frequency clock, and has nothing to do with the 32k clock
    • LFRC is the internal 32k low frequency clock

    Your questions:

    1. What possible causes could be there for the hang-up of nRF52840? As mentioned, it is powered by a 3V coin cell, which is not discharged and running off the internal HFINT oscillator. 
    • As   mentions, voltage dips can lead to the device being stuck in a reset loop
    • Or it could be software related. How has the device been configured to handle software asserts? Is it resetting on assert? If not, it will be stuck and not recover when an assert occurs.

    2. Can the 32k external oscillator being present in the sensor affect the stability of HFINT oscillator?

    No, it will not affect the stability of the LFRC oscillator

    3. Is the way in which we have enabled the internal oscillator good enough? Do we need to do any calibrations for it?

    Calibration should be automatically set when you enable LFRC in the configs. If you look through the build/zephyr/.config file you can check if calibration has been enabled or not. This file shows the status for all configs after build.

    4. Can the internal WDT restart the HFINT oscillator, if it has stopped for whatever reason?

    The WDT must be enabled. It will reset the chip if a hardfault/CPU lockup has occurred.

    5. How to enable, load and hit the internal WDT from application FW?

    Here is a sample that shows how to use the WDT:
    https://github.com/nrfconnect/sdk-zephyr/tree/main/samples/drivers/watchdog

    API:
    https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/zephyr/hardware/peripherals/watchdog.html

    Questions for you:

    1. Has the failure occurred multiple times on the same device?
    2. Have you been able to reproduce this on the returned devices?
    3. Does it recover after a power on reset?
    4. When the device is in this lockup state, are you able to probe voltage levels on different pins and post the results here? Pins of interest are, DEC1, DEC4 and nRESET as well as VDD.
    5. Is pin-reset enabled (Look for this in the build/zephyr/.config file: CONFIG_GPIO_AS_PINRESET)? If so, can you try to disable it and see if the lockup still occurs?
    6. How is a software assert being handled? Is it resetting? More specifically is RESET_ON_FATAL_ERROR being set?
    7. Are you able to connect a debugger and see where in the program is when the device stops working? You can use nrfjprog --readregs to read out the relevant registers. Note that starting a debug session may reset the device, unless you choose to connect to running target. So I will suggest using nrfjprog --readregs which will not reset the device.
    8. Can you post the current profile (.ppk file) of the device when it is in this lockup state?

    Best regards,
    Stian

  • Hi Stian,

    Many thanks for your reply, much appreciated.

    To answer your questions.

    1. No this is a super rare event. We have seen three sensors done this in lab. None of them showed the behavior again. Having said that, we have restarted two devices last week and one this week. So dont know if it may pop up again in the future. We are observing these three. 

    There was a one reported from field about two weeks back. That one also working so far.

    2. No we haven't. Now we have implemented WDT using the task wdt API with HW foldback enabled. 

    3. Yes it recover from power cycle.

    4. I have one sensor in this state I am keeping without power cycle yet. Following are the voltages.

    VDD : 3.02V (No drops detected on Oscilloscope, dont know if any tx is happening)

    nRESET : 3.02V (No drops detected)

    DEC1 & DEC4 : I am using Raytac MDBT50Q module, so these signals are not exposed for measurements.

    5. CONFIG_GPIO_AS_PINRESET is enabled. You reckon a false reset happening on the nRESET? If after enabling WDT 

    we get this issue, I will try this.  

    6. 'CONFIG_RESET_ON_FATAL_ERROR' in .config file is enabled. I guess this means our sensor would reset for FW asserts. 

    7. When I tried this on a different sensor, it seems like the sensor got reset. Is there a way to read the registers without resetting the SoC? I have only one sensor in this state and the issue is not recreatable as of yet.

    8. Unfortunately, to connect the profiler, I need to disconnect the onboard battery, which will take the sensor out of this state. I tried powering another sensor from the profiler with the sensor battery on board and then removing the sensor battery very carefully, but it reseted the sensor, no matter how saftely I tried to do it.

    If there are no other tests you want me to carry out on this locked-up sensor, I will try to read the registers as in Step 7.

    Thanks,

    Kaushalya

  • Hi, thanks for the answers.

    If you are measuring a constant 3V on the failing device, I doubt that we are looking at a brown out reset (BOR) loop or similar. At 3V the device will recover, and you would have seen VDD dips below BOR threshold if the battery was not able to supply enough current for the boot sequence (i.e. reset loop). The only thing I can think of is that the device ended up in a BOR loop and because of that enters an unresponsive state, where it does not consume much current, so that the battery recovers back to 3V, but the device is still unresponsive.

    The comment regarding nRESET was to check if the internal pullup resistor had been enabled, which it is, according to your measurements. But I would still like you to disable pin reset at some point, to see if the changes anything. So please try this after the WDT test.

    The nrfjprog --readregs should not reset the device. I think that should be the next debugging step.

    You are also welcome to send a couple of devices to our lab. I understand that it takes a long time to reproduce, so I guess you want to keep the unresponsive devices. But if you want, you can send me one of these unresponsive devices, and I can take a look, or you can send one that has not yet failed, and I can leave it running and see if it fails. It's up to you. I will send you a PM with the address.

Reply
  • Hi, thanks for the answers.

    If you are measuring a constant 3V on the failing device, I doubt that we are looking at a brown out reset (BOR) loop or similar. At 3V the device will recover, and you would have seen VDD dips below BOR threshold if the battery was not able to supply enough current for the boot sequence (i.e. reset loop). The only thing I can think of is that the device ended up in a BOR loop and because of that enters an unresponsive state, where it does not consume much current, so that the battery recovers back to 3V, but the device is still unresponsive.

    The comment regarding nRESET was to check if the internal pullup resistor had been enabled, which it is, according to your measurements. But I would still like you to disable pin reset at some point, to see if the changes anything. So please try this after the WDT test.

    The nrfjprog --readregs should not reset the device. I think that should be the next debugging step.

    You are also welcome to send a couple of devices to our lab. I understand that it takes a long time to reproduce, so I guess you want to keep the unresponsive devices. But if you want, you can send me one of these unresponsive devices, and I can take a look, or you can send one that has not yet failed, and I can leave it running and see if it fails. It's up to you. I will send you a PM with the address.

Children
  • Hi Stian,

    Ok, so the sequence of events are like this as you suggest,

    1. SoC enters a BOR and doesn't recover or hang-up

    2. Due to this hang-up, no current drawn from the coin cell.

    3. Due to no current draw, the coin cell voltage jumps back to 3V. 

    4. Because the SoC is in a hang-up state, the coin cell remains at 3V, which we see now.

    So if this is what happened, is it possible to recover when power cycled? If the battery is discharged, the internal resistance should have increased to a level which cannot sustain the current draw from the SoC isn't it? So if not immediate, we should see another hang-up quite soon from the same device. (we didn't replace batteries in these devices)

    When I tried ''nrfjprog --readregs" on a working device, it seemed reset, even after I disconnect the nReset line from the nRF52840K. I dug bit deeper into this. I downloaded the original release fw version to a sensor powered via a profiler and I was monitoring the current draw. 

    As you can see, after the command is executed, it does seem to affect the current draw and the device seems locked up. The reset I was seeing earlier was due to the WDT. So I have not done this on the hang-up sensor I have, which we may ship to you for further analysis, as you suggest. If you  have any thoughts on this, please let me know.

    I am thinking of using the battery from another sensor which hang-up earlier, but is now working after power cycling to a test to compare the IR with a brand new coin cell. I will keep you posted.

    Cheers,

    Kaushalya

  • kaushalyasat said:
    So if this is what happened, is it possible to recover when power cycled? If the battery is discharged, the internal resistance should have increased to a level which cannot sustain the current draw from the SoC isn't it? So if not immediate, we should see another hang-up quite soon from the same device. (we didn't replace batteries in these devices)

    Yes, I agree. So not likely the cause. (But I don't think we should rule out anything at this point)

    kaushalyasat said:
    As you can see, after the command is executed, it does seem to affect the current draw and the device seems locked up. The reset I was seeing earlier was due to the WDT. So I have not done this on the hang-up sensor I have, which we may ship to you for further analysis, as you suggest. If you  have any thoughts on this, please let me know.

    So after issuing the nrfjprog --readregs command, the chip will enter debug mode, and the CPU will halt. Hence the change in current consumption. But it should not do a reset. It will connect to the running target, halt the CPU, read out the registers, then print them to the screen. Not sure if it is possible to resume the CPU and exit debug mode again without resetting, but you should already have the relevant register information printed to the screen.

    You can try to resume the CPU with nrfjprog --run.

  • Hi Stian,

    Ok, I see. Unfortunately the two hung-up sensors are packaged to be send to you, so I cant run this anymore. You can try when you get the devices, hopefully soon. I have the other sensors which showed this behavior with me, but with WDT fw. So I dont know if if the WDT will heal the issue. 

    I am will do an IR test for the batteries on the failed ones to see how bad the IR is.

    Cheers,

    Kaushalya

  • Hello,

    I was able to power the devices from a power analyzer without resetting them. And I can clearly see that they are in system ON idle, consuming around 2 uA and doing the LFRC clock calibration every 4 seconds. So everything looks normal. Except that it is not advertising.

    I don't think there are any reasons to suspect any battery issues, or any hardware issues, as the CPU is waking up to calibrate the LFRC, which means that the LFRC, RTC, CPU interrupts, etc, are working as normal. So this looks to me like a software issue.

    I will continue to investigate, but it would be nice if you could share your software project, so I can build the code here. Maybe zip it and send it in a PM or similar. Thanks.

    Best regards,
    Stian

  • Hi Stian,

    All your support much appreciated. That is a good news. How could we know its doing the LFRC clock calibration? Just by looking at the 4 sec current draws?

    To send the project, by PM, you mean as a personal message ?

    Cheers,

    Kaushalya

Related