This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nRF52840 logging crashes in FreeRTOS

Recently I have encountered a software hanging issue on one of our products. It was based on nRF52840, with FreeRTOS (Thread SDK v1.0). The issue was apparently related to NRF_LOG_* stuff - if I added lots of logs, software would hang very quickly (whole FreeRTOS seems to be hanging - not just one task). If I have moderate amount of logs, crashes were random and rare. After a bit, I managed to reproduce this problem on PCA10056 board (to rule out our hardware), and then on one of the examples provided within the SDK.

To reproduce the issue:

- download fresh Thread SDK 1.0

- replace examples/thread/freertos_coap_server/main.c with https://gist.github.com/Nopik/6efc683c0f5cfd709302261b81ccee4d

- make

- nrfjprog -f NRF52 --chiperase --program _build/nrf52840_xxaa.hex

- connect RTT viewer (I'm using JLink + JLinkRTTClient)

- run CPU (I'm using just 'r' in JLink)

- some logs should appear in large amounts on RTT, leds should be blinking

- if leds are still blinking after 10-20 seconds, reset CPU and start it again (I'm using just 'r' and 'g' in JLink)

Expected behavior: CPU never crashes, example runs indefinitely

Observed behavior: CPU halts very often. Sometimes after 7 logs, sometimes after 9500 of them. On my setup, I just started CPU 10 times, 3 times it was running for at least 20 seconds (and I did reset manually), 4 times it hung after 7 debugs, 3 times it hung after few seconds (logs number 9713, 8529, and 11639, respectively).

It seems, that the hangs still occur, even if we remove NRF_LOG_INFO() from led task. They stop occurring only if I remove log flushing task as well.

I tested it on MacOS and Linux machines, on 2 different PCA10056 boards, also on custom software running on custom board. Interestingly, we've been developing our software for quite long time, and been using logging a lot - and this is first time we've encountered instability like this.

On our software, if deferred logs are turned off, situation is much worse, usually crashes are much more frequent.

Is this known bug? Any fix recommended?

  • Hello,

    Sorry for the late reply. I have been trying to reproduce this. That is, I have managed to reproduce it, but not while debugging, so I am having some problems to find the reason. I just wanted to let you know that your ticket is being handled. I will get back to you as soon as I know something more.

    Best regards,

    Edvin

  • Ok, thanks. Hopefully you'll be able to pinpoint the reason soon. Good to know, that you've reproduced the issue.

  • Hello,

    I had a colleague look at this (who has worked with our FreeRTOS stack). 

    I did manage to reproduce your issue, as I mentioned, but not nearly as often as you report. Maybe one in 100 times or so, so it took quite a while. My go-to was to just restart it unless LED3 didn't blink at all.

    However, my colleague suggested some changes in the FreeRTOSConfig.h, and I haven't been able to reproduce it after this.

    Could you please try to replace your FreeRTOSConfig.h file with the attached one, and see whether you are able to reproduce it with this version?

    FreeRTOSConfig.h

    Best regards,

    Edvin

  • So, the problem is still there, though it seems to be much more buried. When I got your reply, I started with replicating the bug on my side, and indeed - it now happens to me much less often (more like 1 in 100, as you mentioned). I'm using the same bin file and same hardware as before, which is quite puzzling. Yet, the problem originates from our proprietary software (which is using FreeRTOS and Thread), on custom board.

    Earlier I've pinpointed the problem to the log output - when I've been putting tons of logs (just for troubleshooting purposes, stuff like `for( int i=0;i<100;i++ ) NRF_LOG("foo");`, the code was crashing almost instantly. Then, I reproduced it on PCA10056 and almost clean example, so that is what I reported. I will try to investigate it again on our code version.

    In meantime, I tried to speed up the crashes in the example you've been working on. Since often code was crashing quickly after start, I did something simple: if task looped for some time, then reset. I installed wire between some random gpio (P1.11 in my case) and P0.18 on my PCA10056, and added to the main task1 something like this:

    ```

    if( cnt > 20 ) {
      nrf_gpio_cfg_output(RESET_PIN);
      nrf_gpio_pin_clear(RESET_PIN);
    }

    ```

    i.e. reset after roughly 100 logs.

    The results were just stunning, so different, that I'm afraid that I've uncovered some other problem.

    When I run this code, it runs for 13-14 seconds (100% constant), then crashes - by corrupting flash. After it crashes, it wont boot up anymore, until I do `nrfjprog -e` and flash the code again. UICR is intact, `nrfjprog --eraseuicr` is not enough to recover. When I dumped flash before and after crash, indeed, some parts of it (but not the code itself) are overwritten by some repeating pattern. 

    I'm not sure if the reset method is valid (though with external watchdog it is very similar to what would happen), but whatever happens, flash content shouldn't be corrupted.

    If I extend the timer and reset only after `cnt` reaches 100 or 1000, it just happens slower, but also corrupts the flash.

    Can you confirm, that you can reproduce this problem? My current `main.c` is at: https://gist.github.com/Nopik/d59d269032fa5fe4e3b2af5001dd4307 Is this reset method valid? I suspect that electrically the reset impulse is simply too short and causes some misbehavior, or something.

    PS. I've seen that Thread SDK v2.0 was released today, I will try it as well, and let you know.

Related