application runs 9 times out of 10 NRF_ERROR_RESOURCES error free and then, once in a while, NRF_ERROR_RESOURCES massively present from the beginning

To the kind attention of Nordic support team,

I'm testing a freertos project, with softdevice and radio notifications. A constant number of notifications is queued before the starting of the connection interval, and sent during the connection interval itself. It works very well, I get this high speed data stream. Every time I get a sporadic NRF_ERROR_RESOURCES, the feedback mechanism exploiting BLE_GATTS_EVT_HVN_TX_COMPLETE starts working as well, and the resource error disappear after a while.

Everything works fine, like 9 executions of the program out of 10 are really stable and NRF_ERROR_RESOURCES free. If I reset (Ctrl+Shift+F5 using Segger), it seems that, once in a while, from the beginning of the connection NRF_ERROR_RESOURCES is massively there, and it never goes away. Only reducing the number of queued notifications help.

But why the number of notifications should be reduced once in a while? All this sounds to you like a problem in the application, or there could be something changing in the connection? I thought about master forcing a different connection interval than the desired one. But using BLE_GAP_EVT_CONN_PARAM_UPDATE_REQUEST I have no evidence for now that this behavior is due to a change in connection interval timing. I attached systemview files to the project and next days I'll be possibly able to post some more thing about this issue (also I'm gonna use Nordic sniffer). Really, just a quick opinion from your experts would be very much appreciated. Also, any debug strategy you would recommend.

Thank you in advance, best regards.

Parents
  • Hi Einar, I was using https://github.com/jimmywong2003/nrf52-ble-range-estimator as an example about how to properly monitor radio events as NRF_RADIO->EVENTS_TXREADY and NRF_RADIO->EVENTS_CRCOK. I got same results both when executing in my custom board and in pca10056. Still, I have this feeling that in my custom board, something is happening that is sometime spoiling softdevice activity, from the very beginning, so that it doesn't work properly and tx buffers are not able to empty themselves smoothly (once in a while buffered notifications number approach the reserved slots - gatts_conn_cfg.hvn_tx_queue_size- and the NRF_ERROR_RESOURCES starts to be fired). Is there any event register between softdevice used peripheral resources that could be check so to identify if softdevice is properly working? Or something went wrong during initialization? For example I'm monitoring NRF_CLOCK->EVENTS_DONE in my custom board, when softdevice appears to not work smoothly, and recalibration events seems to be ok. Is there something I could check about rtc0 status? timer0? Again, I cannot see this softdevice tx buffers malfunctioning  when running the very same test program in a  standard pca10056, that is why I'm incline to think it could be an hw issue.

    Best regards

  • Hi,

    astella said:
    I had the idea, as also suggested in one of your Nordic threads to test my program in a regular pca10056, and not in my custom hardware. It seems that there could be some activity - from my sensors - that is disturbing the antenna. When executing in pca10056, softdevice tx buffer fills up, but it never has a hard time emptying itself. While in my hardware lot of noise could be the root cause of retransmissions and not exploited connection intervals. I got these two screenshot using Nordic power profiler:

    It is interesting that you see a difference in the DK and your custom HW. But that could be explained by different things. Perhaps you are also not doing the same thing when running on the DK, because of lack of some external components? What are the differences when you run on the DK compared to your custom HW? The saw-tooth current consumption pattern here is eye catching - do you have any idea what it comes from?

    astella said:
    Is there any other debug technique you could please suggest in order to 100% validate this feeling? 

    If this is noice related (which could be, but it is not the first think would think of), then you would see that form a sniffer trace. This is one of the reasons I asked you about that before, but there are also other things we might see from that.

    astella said:
    Is there any event register between softdevice used peripheral resources that could be check so to identify if softdevice is properly working?

    Not really. Also, it is difficult to confirm that it is properly working. But the SoftDevice is very well tested, so without a strong indication that there is an issue with the SoftDevice, that would be one of the last things to consider. There are a lot of reasons why you may not be able to push as much data as you want.

    astella said:
    Again, I cannot see this softdevice tx buffers malfunctioning  when running the very same test program in a  standard pca10056, that is why I'm incline to think it could be an hw issue.

    Does the firmware also behave the same, or differently because of some external components? Can you describe your HW in a bit more detail? Also, can you describe what you firmware does other than sending notifications?

    I suggest the following next steps:

    1. Make a sniffer trace. Does that tell you something? For instance about retransmissions (which always happen in the next connection event), or something else? What about the MD (more dat) bit? Is that set as expected?
    2. Is the nRF doing something else, perhaps more in some situations that could prevent the SoftDevice from doing as much BLE activity as expected, like flash operations? What if you test without these activities? If you comment out most of your code except from sending dummy data, and gradually include more, perhaps you can quickly experimentally see which parts could be related. If so, what are those?
  • Hi,

    astella said:
    Is it possible for you to please share a patch for ble_app_hids_mouse program that is sending notifications in a loop, with this modifications https://jimmywongiot.com/2021/05/14/how-to-configure-the-number-of-packets-per-every-ble-connection-interval/ according to what you consider best practice to do that using latest sdk?  

    I am not sure what you are after or why this post is relevant for this issue (where something cause a bunch of re transmissions), but I also do not know much of your code. To send as effectively as possible, you basically try to send notifications as much as possible in a loop, but when there is no room fore more, you wait for a BLE_GATTS_EVT_HVN_TX_COMPLETE before you continue. That is all there is to it. The ble_app_hids_keyboard (examples/ble_peripheral/ble_app_hids_keyboard/main.c) project already use the BLE_GATTS_EVT_HVN_TX_COMPLETE event in a similar manner, where it processes the buffer on every BLE_GATTS_EVT_HVN_TX_COMPLETE event, so that data is sent as fast as possible.

    astella said:
    If any interference, why should be present once in a while, and last indefinitely for the duration of the connection having troubles?

    I do not have a good explanation for that, and it is not expected. I suspect your application ends up in a bad state, but I cannot say more as I have not seen it. It should be possible to learn about this by debugging.

    astella said:
    I would like to understand if, once in a while, there is some sort of clock drift, that is causing this troubling communication. Is there any way you would recommend in order to check this?

    That is a good point, LF clock issues could very well be the root cause here. Which LF clock source do you use, and how is it configured (what are the NRF_SDH_CLOCK_LF_* set to in your sdk_config.h?). Are you able to reproduce this if you set  NRF_SDH_CLOCK_LF_SRC to 2? This will use a LF clk synthesized from the HF clock, so it will give a high current consumption, but it would be interesting to know if that resolves the issue. If so, we need to look more into your LF clock.

  • Hi Einar, thank you for your suggestion. For the moment is not solving the issue, though. I wanted to ask if any imprecision in main quartz load capacitance could give such an intermittent behavior, in your opinion. 

    Best regards

  • Yes, it could. If the crystal is incorrectly loaded, the frequency will be off. And also, if you have configured the SoftDevice incorrectly, specifying a better clock source accuracy then you actually have, that could give similar problems.

    Can you share your clock configuration (the NRF_SDH_CLOCK_* defines from your sdk_config.h)? Also, which LF clock source do you use? If a crystal, can you let me know which exact crystal you are using and the value of your load caps?

    The above was about the LF clock source which is normally the main suspect in such cases. It is theoretically possible that there is an issue with your HF crystal as well, but I would expect that to manifest itself differently. It could be worth checking anyway though (just verifying that the load caps are calculated as explained in General PCB design guidelines for nRF52 series.

  • Hi Einar,

    Thank you for your suggestions. We are not using external xtal as 32.768Hz oscillator. So we are setting 

    NRF_SDH_CLOCK_LF_SRC 0

    NRF_SDH_CLOCK_LF_RC_CTIV 16

    NRF_SDH_CLOCK_LF_RC_TEMP_CTIV 16 // other values tested

    NRF_SDH_CLOCK_LF_ACCURACY 1

    In SoftDevice specification is briefly mentioned that timing warning about HFXO startup time. We will be

    doing such measurements as well. I'm curious about what "not correct operation" would mean, though. 

    This is our main quartz. According to your suggestion C21,22 should be

    C21,22 = 2*Cl - Cpcb - Cpin = 2 * 10 - Cpcb - 3

    Supposing "similar layout to Nordic layout" and Cpcb = 1pF 

    2 * 10 - 3 - 1 about 16pF

    We have 12pF mounted for the moment. We will try to do rf measurements to understand if Cload trimming is needed

    Best regards

  • Hi,

    The config looks roughly OK, but the recommendation is to set NRF_SDH_CLOCK_LF_RC_TEMP_CTIV to 2 when using LFRC in order to do calibration often enough even if the temperature has not changed.

    There is little in the described behavior that points to an issue with the HFCLK, but as you write the load caps should be about 16 pF in your case, so that should anyway be fixed.

    astella said:
    doing such measurements as well. I'm curious about what "not correct operation" would mean, though. 

    This is mostly about timing. For instance, the Bluetooth stack needs to enable the HFXO a certain amount of time before using the radio so that the frequency is not off etc.

Reply
  • Hi,

    The config looks roughly OK, but the recommendation is to set NRF_SDH_CLOCK_LF_RC_TEMP_CTIV to 2 when using LFRC in order to do calibration often enough even if the temperature has not changed.

    There is little in the described behavior that points to an issue with the HFCLK, but as you write the load caps should be about 16 pF in your case, so that should anyway be fixed.

    astella said:
    doing such measurements as well. I'm curious about what "not correct operation" would mean, though. 

    This is mostly about timing. For instance, the Bluetooth stack needs to enable the HFXO a certain amount of time before using the radio so that the frequency is not off etc.

Children
No Data
Related