URGENT : NRF52840 hanging up after working for some time!!!

Hi,

THIS IS AN URGENT REQUEST. PLEASE HELP ASAP.

After deploying our sensors based on nRF52840 running coap client on OpenThread and powered by 3V 2032 Lithium coin cell, we have noticed that two out of about 50 sensors just not sending data to the host. One of the sensors we know has hung up with no response to the button push and lo LED indication. The other have the characteristics of it, but not yet confirmed by the client. The issue is not immediate, some like couple weeks, some couple of days. We have seen two of our sensors, which are being tested in our lab, to hang up as well. We dont have an external watchdog. Not sure if the internal watchdog is enabled.

Once the sensor hangs up, you can make it come back alive by power cycling. 

Now we have an external 32k oscillator in our design, which showed stopping after some time probably due to capacitor loading not being correct. So we revert to internal oscillator, which seem to solve the issue. That is how these sensors been working so far and release to customers. But the external oscillator and two loading caps exist in the PCB. I dont know if this can be an issue if we use HFINT internal oscillator. 

We have enabled the internal oscillator by the following in the prj.conf.

CONFIG_CLOCK_CONTROL=y
CONFIG_CLOCK_CONTROL_NRF_K32SRC_RC=y
CONFIG_CLOCK_CONTROL_NRF_K32SRC_XTAL=n

We have not done any calibrations for the HFINT. 

So these are my questions.

1. What possible causes could be there for the hang-up of nRF52840? As mentioned, it is powered by a 3V coin cell, which is not discharged and running off the internal HFINT oscillator. 

2. Can the 32k external oscillator being present in the sensor affect the stability of HFINT oscillator?

3. Is the way in which we have enabled the internal oscillator good enough? Do we need to do any calibrations for it?

4. Can the internal WDT restart the HFINT oscillator, if it has stopped for whatever reason?

5. How to enable, load and hit the internal WDT from application FW?

Cheers,

Kaushalya

Parents
  • Hi Kaushalya,

    Just a FYI first:

    • HFINT is the internal high frequency clock, and has nothing to do with the 32k clock
    • LFRC is the internal 32k low frequency clock

    Your questions:

    1. What possible causes could be there for the hang-up of nRF52840? As mentioned, it is powered by a 3V coin cell, which is not discharged and running off the internal HFINT oscillator. 
    • As   mentions, voltage dips can lead to the device being stuck in a reset loop
    • Or it could be software related. How has the device been configured to handle software asserts? Is it resetting on assert? If not, it will be stuck and not recover when an assert occurs.

    2. Can the 32k external oscillator being present in the sensor affect the stability of HFINT oscillator?

    No, it will not affect the stability of the LFRC oscillator

    3. Is the way in which we have enabled the internal oscillator good enough? Do we need to do any calibrations for it?

    Calibration should be automatically set when you enable LFRC in the configs. If you look through the build/zephyr/.config file you can check if calibration has been enabled or not. This file shows the status for all configs after build.

    4. Can the internal WDT restart the HFINT oscillator, if it has stopped for whatever reason?

    The WDT must be enabled. It will reset the chip if a hardfault/CPU lockup has occurred.

    5. How to enable, load and hit the internal WDT from application FW?

    Here is a sample that shows how to use the WDT:
    https://github.com/nrfconnect/sdk-zephyr/tree/main/samples/drivers/watchdog

    API:
    https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/zephyr/hardware/peripherals/watchdog.html

    Questions for you:

    1. Has the failure occurred multiple times on the same device?
    2. Have you been able to reproduce this on the returned devices?
    3. Does it recover after a power on reset?
    4. When the device is in this lockup state, are you able to probe voltage levels on different pins and post the results here? Pins of interest are, DEC1, DEC4 and nRESET as well as VDD.
    5. Is pin-reset enabled (Look for this in the build/zephyr/.config file: CONFIG_GPIO_AS_PINRESET)? If so, can you try to disable it and see if the lockup still occurs?
    6. How is a software assert being handled? Is it resetting? More specifically is RESET_ON_FATAL_ERROR being set?
    7. Are you able to connect a debugger and see where in the program is when the device stops working? You can use nrfjprog --readregs to read out the relevant registers. Note that starting a debug session may reset the device, unless you choose to connect to running target. So I will suggest using nrfjprog --readregs which will not reset the device.
    8. Can you post the current profile (.ppk file) of the device when it is in this lockup state?

    Best regards,
    Stian

  • Hi Stian,

    Many thanks for your reply, much appreciated.

    To answer your questions.

    1. No this is a super rare event. We have seen three sensors done this in lab. None of them showed the behavior again. Having said that, we have restarted two devices last week and one this week. So dont know if it may pop up again in the future. We are observing these three. 

    There was a one reported from field about two weeks back. That one also working so far.

    2. No we haven't. Now we have implemented WDT using the task wdt API with HW foldback enabled. 

    3. Yes it recover from power cycle.

    4. I have one sensor in this state I am keeping without power cycle yet. Following are the voltages.

    VDD : 3.02V (No drops detected on Oscilloscope, dont know if any tx is happening)

    nRESET : 3.02V (No drops detected)

    DEC1 & DEC4 : I am using Raytac MDBT50Q module, so these signals are not exposed for measurements.

    5. CONFIG_GPIO_AS_PINRESET is enabled. You reckon a false reset happening on the nRESET? If after enabling WDT 

    we get this issue, I will try this.  

    6. 'CONFIG_RESET_ON_FATAL_ERROR' in .config file is enabled. I guess this means our sensor would reset for FW asserts. 

    7. When I tried this on a different sensor, it seems like the sensor got reset. Is there a way to read the registers without resetting the SoC? I have only one sensor in this state and the issue is not recreatable as of yet.

    8. Unfortunately, to connect the profiler, I need to disconnect the onboard battery, which will take the sensor out of this state. I tried powering another sensor from the profiler with the sensor battery on board and then removing the sensor battery very carefully, but it reseted the sensor, no matter how saftely I tried to do it.

    If there are no other tests you want me to carry out on this locked-up sensor, I will try to read the registers as in Step 7.

    Thanks,

    Kaushalya

  • Kaushalya,

    You should be able to move the DevZone ticket to 'Private' this will allow you to post any more sensitive or confidential materials such as your code directly to the Nordic team. Regards.

  • Thanks Wendell, I tried looking for ways to move this to private, but couldn't find. I can move my original post, but there is no 'private' section in the list. Could you please elaborate a bit?  

    Thanks,

    Kaushalya

  • Hi Stian,

    I have created a private ticket under the same name to include the FW. I dont know how to share it with you. If you cant access it, please let me know how to add you to that ticket.

    Cheers,

    Kaushalya 

  • Hi, I think I'm the only one who can make this ticket private. Anyways, now that you have shared the code in the other ticket, I can just get it from there, and we can keep this one public. I can access the ticket.

    kaushalyasat said:
    How could we know its doing the LFRC clock calibration? Just by looking at the 4 sec current draws?

    Yes, 4 second interval and the length + current during the calibration event. Debugging using the current consumption profile is very useful, as you can see exactly what is going in (as long as you know what to look for of course)

  • Hi Stian,

    Got you. Please let me know what you think about the code. This is based on CoAP client example code, so there are still certain sections of it remaining in the code, which we don't need. I don't know if that could cause things like this. 

    I have got another sensor in the lab, which has hung up. Hung up in the sense no response of the LED to button press and no radio comms. When you connected your power analyzer, how did you do it without disrupting the power to the module? I have tried parallel powering and then remove the battery, but it caused a POR every time.

    Cheers,

    Kaushalya

Reply
  • Hi Stian,

    Got you. Please let me know what you think about the code. This is based on CoAP client example code, so there are still certain sections of it remaining in the code, which we don't need. I don't know if that could cause things like this. 

    I have got another sensor in the lab, which has hung up. Hung up in the sense no response of the LED to button press and no radio comms. When you connected your power analyzer, how did you do it without disrupting the power to the module? I have tried parallel powering and then remove the battery, but it caused a POR every time.

    Cheers,

    Kaushalya

Children
Related