This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Softdevice Assert at PC=0x15810 (S132 7.2.0) / RTC clock drift when using timeslot

I'm trying to narrow down the cause of a Softdevice assertion happening in S132 7.2.0 at PC=0x15810.

We set up a proprietary RF project which utilises parts of the SDK for Mesh (specifically, the timeslot implementation and bearer_handler) because it provides a safe base to run high performance timeslot applications on. Unfortunately I do have one device which runs into a softdevice assertion at instruction 0x15810. I feel that it is a timing issue - maybe the device is operating at the outer limits of the clock accuracy, because while the issue appears sporadically on Development Kits or other devices, this specific device does trigger it quite often.

  • What exact assertion fails when at PC=0x15810?
  • Does the Softdevice shut down TIMER0 before doing this test or after an assertion fails?
  • Are timing assertions made by the softdevice based on RTC0?
  • Is there a reason why TIMER0 in the mesh stack is running in 24-bit mode as opposed to 32-bit mode?

Any help is greatly appreciated.

EDIT: In the meantime I think I found the cause of the issue. Assuming that the timing assertions by the Softdevice are done using RTC0, there seems to be a rather large discrepancy between the RTC0 timing and TIMER0 timing. After 9'999'249us on TIMER0 pass, RTC0 has counted 10'000'732us, so they're almost 1ms apart!

The device in question is running the LFCLK from the RC oscillator and we do usually have BLE deactivated. I did assume that the softdevice takes care of adjusting for clock drift, but could it be that I have to somehow take care of this manually?

EDIT2: Note that - as we're using the nRF SDK for Mesh as a codebase - when calculating the available time on the timeslot, we should already account for clock drift per the following calculation:

(p_timeslot->length_us * (m_lfclk_ppm + HFCLOCK_PPM_WORST_CASE)) / 1000000;

EDIT3: I previously wrote that we "do usually have BLE deactivated". What I actually mean by this is that most of the time the device is neither connected nor is it currently advertising. So there is no BLE activity to schedule by the softdevice. Timeslots are always active, though.

Parents
  • Hi,

    The assert at 0x15810 is because the SoftDevice got an unexpected raio interrupt, which is typically because the application used the radio outside of the timeslot. Regarding LFCLK calibration that is handled automatically by the SoftDevice. And in any case RTC drift should not be relevant here, as there is just a single low frequency clock source in the nRF, so even if there was a significant drift, the app and SoftDevice would be drifting "together".

    Looking at your timeslot length calculation, it seems like you add the worst case drift value to the duration. Is that intentional? Should it not be subtracted?

  • Hi Einar,

    Besides the issue with the discrepancies between RTC0 time and TIMER4 time, I noticed that TIMER0 is shut down while control has not yet returned to the softdevice.

    Just to detail the setup: The issue seems to always occur at the same point in software. While a timeslot is granted, I have a timer running at highest IRQ priority, triggering a radio interrupt via software (NVIC_SetPendingIRQ(RADIO_IRQn)). This is used for frequency hopping and implemented like this for easier handling of the radio control state machine in our proprietary receiver.

    The softdevice seems to assert after this interrupt has fired. So the cause of the assert that you have given would make sense. BUT at the point the assertion occurs, the timeslot has not ended!  It was still ongoing. I added a few checks into the above IRQ handler and noticed that at the point where the assert would occur, the timeslot *should* have already ended, but for some reason TIMER0 which should be in charge of ending and cleaning up the timeslot was shutdown. I.e. the CC and INTENSET registers are cleared and when I try to manually capture the current timer value using the debuger (setting TASKS_CAPTURE[n] to 1), nothing happens. I do have no code or PPI whatsoever that would shut down TIMER0 in my software and it is only used for controlling timeslot duration - and sometimes capturing its value. So it has to be shut down by the softdevice at some point.

    Could you imagine when and why TIMER0 could get shutdown by the softdevice?

    Also, asking again; does the assert 0x15810 also fail if the timeslot during which the Radio IRQ is invoked went on for too long?

    Thanks for your help.

  • Hi Einar,

    I tried to subsequently increase the TIMESLOT_END_SAFETY_MARGIN_US until either not observing any Softdevice Asserts or reaching 1000us. Unfortunately, I did always observe the described softdevice assert at some point. It looks like increasing it reduces the probability of the asserts' occurence, though which is at least something.

    Still hoping for a definitive solution!

    Thank you & best regards,

    -mike

  • Hi Mike,

    This is a tricky one it seems. I cannot promise anything at this point but I have forwarded this to the team. I will let you know if they make any progress.

    Einar

  • Hi Einar,

    While trying to reproduce another timeslot-related issue (Softdevice event handler continuously called with NRF_EVT_RADIO_BLOCKED), I found another angle by which I seem to be able to reproduce the same basic issue.

    Hung Bui, the support engineer in the linked devzone entry, convinced me to try to reproduce the other issue based on his nRF52 BLE - ESB Timeslot example. Doing this, I managed to reproduce this issue here, wehere the softdevice event handler unexpectedly gets called with NRF_EVT_RADIO_SESSION_IDLE on an nRF52DK (PCA10040).

    Steps to reproduce with his example:

    1. Get the linked example to run as per its instructions.
    2. Attach the patch attached to this comment
    3. Run the example
    4. Connect a BLE central to the peripheral (advertising as "Nordic_UART").

    This will result in NRF_EVT_RADIO_SESSION_IDLE being invoked within a few milliseconds after establishing the BLE connection.

    The patch contains the following changes:

    • Added and enabled GPIOs for debugging:
      • P0.03 - Signals whether device is in timeslot
      • P0.04 - High while in radio callback (signal ID is indicated by repeated toggles)
      • P0.28 - Set when callback_action set to NRF_RADIO_SIGNAL_CALLBACK_ACTION_EXTEND, cleared on EXTEND_SUCCEEDED or EXTEND_FAILED
      • P0.29 - High while serving SIGNAL_TYPE_TIMER0
      • P0.30 - High while in softdevice signal handler (signal ID is indicated by repeated toggles)
      • P0.31 - High while serving TIMER0 interrupt where timeslot is ended.
      • P0.25 - Indicates EXTEND_SUCCEEDED.
    • Reduced the length_us parameter in the radio request to 3800 in case a timeslot is BLOCKED or CANCELED.

    example-uart-esb-timeslot.patch

    So, the issue occurs immediately if the length_us of the requested timeslot is reduced from 5000 to 3800. I did not play around with the values any further!

    I hope this may help you in getting to the cause of this issue! In the meantime, I'll see if increasing the minimum length requested may resolve it in may application!

    Best regards,

    -mike

  • Hi Mike,

    Thanks for the update. I will forward the information to the team.

  • Hi Einar,

    Thanks. Just for the sake of completeness, I also attached the Saleae Logic 2 traces I screenshotted further up. The first one is from our application, the second one from the mesh example.

    softdevice_assert_logic_traces.zip

    Additionally, I'd like to ask what the Softdevice assert at PC=0x24f24 indicates? I assume this is the one that gets raised when the timeslot was ended too late? I very rarely see this one (as opposed to the "regular" one at PC=0x15810) and assume it can be averted by the workaround suggested above, increasing TIMESLOT_END_SAFETY_MARGIN_US.

    Looking forward to the results of the team & thanks for your support!

    -mike

Reply
  • Hi Einar,

    Thanks. Just for the sake of completeness, I also attached the Saleae Logic 2 traces I screenshotted further up. The first one is from our application, the second one from the mesh example.

    softdevice_assert_logic_traces.zip

    Additionally, I'd like to ask what the Softdevice assert at PC=0x24f24 indicates? I assume this is the one that gets raised when the timeslot was ended too late? I very rarely see this one (as opposed to the "regular" one at PC=0x15810) and assume it can be averted by the workaround suggested above, increasing TIMESLOT_END_SAFETY_MARGIN_US.

    Looking forward to the results of the team & thanks for your support!

    -mike

Children
  • Hi Mike,

    Thanks for the traces.

    Regarding the assert at PC=0x24f24 that can happen if the application used too much time in the callback handler as the SoftDevice timing may be thrown off.

    Einar

  • Hi Einar,

    I looked a bit more into the behaviour of the device when this issue occurs - and tried to implement a workaround, too.

    Tracing RTC0->EVENTS_COMPARE[1]

    I did trace RTC0->EVENTS_COMPARE[1] via PPI with a logic analyzer, because this seems to be the timing reference the softdevice uses to assert timeslots are ended in time.

    It looks like the compare event fires significantly earlier than normal when the unexpected behaviour occurs even though the timeslot was extended to the exact same duration as every other timeslot before.

    I noticed that during undisrupted/regular operation, the period of RTC0->EVENTS_COMPARE[1] is in the range of 10'000'457us to 10'000'486us, with a standard deviation of ~28us, whereas the compare-event that occurs right before the unexpeced IDLE event fires is raised after e.g. 9'999'971us, so 486us earlier than the earliest of the other compare events, which is an order of magnitude more.

    It looks like NRF_EVT_RADIO_SESSION_IDLE is then invoked within around 60us of RTC0->EVENTS_COMPARE[1], so lookis like this is triggered from the RTC0 COMPARE1 interrupt handler.

    See the following logic analyzer trace (attached as Saleae Logic 2 data file, too).

    sd-assert-rtc-debug-1.zip


    Tracing RTC0->CC[1]
    I also looked at how the RTC0->CC[1] value changes when timeslots are extended. The nRF SDK for Mesh timeslot extension algorithm initially (when invoking sd_radio_request()) tries to reserve a "short" timeslot of 14000us duration and then tries to match the previous length by extending it up to 10'000'000us (without BLE activity this will be achieved).

    I tried to see whether there were any anomalies when calculating the CC[1] value for the RTC, but did not observe any. When receiving NRF_RADIO_CALLBACK_SIGNAL_TYPE_START, the difference between NRF_RTC0->COUNTER and NRF_RTC0->CC[1] is consistently at 459, corresponding to around 14'000us. After extending to 10s (NRF_RADIO_CALLBACK_SIGNAL_TYPE_EXTEND_SUCCEED), the difference is consistently at 327681, corresponding to ~10'000'000us. This is no different for the timeslot in which the observed issue occurs.

    Looking at another logic analyzer trace where I also traced the LFCLK, (RTC0 TICK event via PPI), I came to the conclusion that the reason of the early compare event apparently seems to be RC clock drift that is larger than assumed.

    The amount of ticks counted from about the time the timeslot extension concluded up to the point the RTC0 COMPARE1 event fires was consistently higher for the period after which NRF_EVT_RADIO_SESSION_IDLE is invoked, while the measured duration was consistently significantly shorter.

    I attached the Logic trace (recorded with the below workaround active, so there are no Softdevice asserts happening...)

    sd-assert-measure-rtc-tick.zip

    Workaround

    I figured - since I can actually detect the unexpected NRF_EVT_RADIO_SESSION_IDLE, I could try a workaround where I just force stop all my timeslot activity (while disabling the MWU region that watches RADIO and TIMER0...) and re-order a new timeslot. It seems to be a viable workaround since in our application this issue seems to occur only or at least mostly while there is no BLE activity (advertising or connection) at all.

    A bug in the timeslot_timer prevented the workaround to work from the get-go, but once I figured that out, it started to look promising.

    I do not include the code here, since it's probably a bad idea to actually use it...

    The actual solution

    As it turns out, the actual solution to the problem seems to be way simpler and less hacky than my workaround. It looks like for some reason the "normal" values of NRF_SDH_CLOCK_LF_RC_CTIV (16, corresponding to 4 seconds) and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV (2, corresponding to force recalibration every second interval - so I assume 8 seconds), seem to not result in an accurate enough LFCLK. I reduced NRF_SDH_CLOCK_LF_RC_CTIV to 4 (corresponding to every second) and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV to 0 (corresponding to recalibration at every interval).

    So far, this has been running stable even on the "wonkier" devices. So two questions remain:

    1. How do I determine the "correct" values for NRF_SDH_CLOCK_LF_RC_CTIV and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV? Why were the defaults not appropriate?
    2. Why does the softdevice not assert once it knows that timeslot is actually overstretched, but only fire the IDLE event? The COMPARE0-IRQ of RTC0 is enabled and active, after all so the possibility for the assert is there! This looks like unexpected behaviour by the softdevice.

    EDIT:

    3. How is the RC oscillator calibrated? Does this run in hardware entirely or is there code running, taking care of this? Is this running in interrupt handlers and at what priority? With As far as I understand , a temperature measurement is required if NRF_SDH_CLOCK_LF_RC_TEMP_CTIV is >= 2. Would be great if you could elaborate on this!

  • Hi Mike,

    These are very interesting findings.

    m.wagner said:
    How do I determine the "correct" values for NRF_SDH_CLOCK_LF_RC_CTIV and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV? Why were the defaults not appropriate?

    This is a good question. I have not found any reports of similar issues with the LFRC. Using regular calibration it should stay within +-500 ppm. If there is an issue here I would expect many other customers seeing a set of other issues as well (for instance missing BLE timing etc.), so it is a bit strange. We have found the default configuration, which you have described, to be good enough to not cause problems while not calibrating too often (in order to save power). I am not able to explain why this does not seem to be the case here. If it was just one device then I would suspect an issue with that specific device, but that doe snot seem to be the case her based on earlier discussions - though there are variations. 

    m.wagner said:
    Why does the softdevice not assert once it knows that timeslot is actually overstretched, but only fire the IDLE event? The COMPARE0-IRQ of RTC0 is enabled and active, after all so the possibility for the assert is there! This looks like unexpected behaviour by the softdevice.

    I see your point, but that is not in line with how the timeslot API is designed. The application does not signal that the timeslot has ended (sd_radio_session_close() is only called when a session ends, i.e. when the app no longer needs timeslots), so the SoftDevice does not know that the app did not respect the time until something unexpectedly happens when it should not.

    m.wagner said:
    How is the RC oscillator calibrated? Does this run in hardware entirely or is there code running, taking care of this? Is this running in interrupt handlers and at what priority? With As far as I understand , a temperature measurement is required if NRF_SDH_CLOCK_LF_RC_TEMP_CTIV is >= 2. Would be great if you could elaborate on this!

    Calibration is done in hardware. The SoftDevice simply starts the HFXO and triggers the CAL task (NRF_CLOCK_TASK_CALL) to trigger calibration and HW does the rest.

    Einar

  • Hi Einar

    I have not found any reports of similar issues with the LFRC. Using regular calibration it should stay within +-500 ppm. If there is an issue here I would expect many other customers seeing a set of other issues as well (for instance missing BLE timing etc.), so it is a bit strange. We have found the default configuration, which you have described, to be good enough to not cause problems while not calibrating too often (in order to save power). I am not able to explain why this does not seem to be the case here. If it was just one device then I would suspect an issue with that specific device, but that doe snot seem to be the case her based on earlier discussions - though there are variations. 

    So, unfortunately I'm not sure what to make of this. It looks like there is something wrong, but what?

    • Could there be an issue with the custom module or environment? I can't imagine that there are temperature variations > 0.5°C within 4s - that would really be unexpected, but I'm not too knowledgeable about hardware these days...
    • We did not observe any unexpected behaviour during BLE connections - which sometimes are established for rather long durations! In fact, it looked like the application runs more stable while BLE is active.
    • As I said, it looks like PCA20020 shows this same occasional behaviour from time to time as well!

    EDIT: I just reverted back to the recommended original LFCLK configuration and traced NRF_CLOCK->EVENTS_DONE and it does not look like any temperature drifts are detected, the calibration is done at the 8s interval throughout!

    I see your point, but that is not in line with how the timeslot API is designed. The application does not signal that the timeslot has ended (sd_radio_session_close() is only called when a session ends, i.e. when the app no longer needs timeslots), so the SoftDevice does not know that the app did not respect the time until something unexpectedly happens when it should not.

    I'm not sure I agree entirely - at least from my "black box perspective": the softdevice still signals NRF_EVT_RADIO_SESSION_IDLE at this point, so it does react somehow at the time the timeslot should have ended. So it looks like it should know that the application did not respect the time.

    But if this is normal behaviour, I think it would be great if it were somehow documented! I don't think I've seen this documented in the Softdevice specification.

    Of course the SD would assert as soon as the application accesses one of the MWU-protected peripherals or something similar.

    Calibration is done in hardware. The SoftDevice simply starts the HFXO and triggers the CAL task (NRF_CLOCK_TASK_CALL) to trigger calibration and HW does the rest.

    Sorry, I was not clear with my question. I'm aware of the NRF_CLOCK->TASK_CAL, but I was more trying to find out when and how this is triggered (through software? PPI? from what interrupt priority? Is there any difference in handling if temperature variation triggers the calibration?

    I'm just trying to figure out if there are any negative impacts if the LF-clock is calibrated more often besides power consumption. (We are not running on battery and have the RADIO receiver enabled almost all of the time - so power consumption is not an issue.)

    Sorry for the unending amount of questions, but this whole issue does bother us quite a bit, as we still lack understanding in cause and details. I'm really happy to have found the workaround with more frequent calibration, thogh, it's quite relieving to be honest, as it looks like even the most "sensitive" device we have around is now running rock-solid.

    Thank you for your patience & support. Best regards,

    -mike

  • Hi Mike,

    m.wagner said:
    So, unfortunately I'm not sure what to make of this. It looks like there is something wrong, but what?

    It seems clear that this is clock related, but not exactly how. Did you by any chance test with NRF_CLOCK_LF_SRC_SYNTH? In that case the LF clock and HF clock would use the same clock source so there will not be any drift. Perhaps it is somehow related to erratum 146 though I do not immediately see how.

    Update: There is a confirmed bug in the SoftDevice where it does not account for drift between HF clock and LF clock in the internal scheduler. I do not immediately see how it relates to this specific case as the SoftDevice only use the RTC and no timer in timeslots, so it would be up to the application to handle this drift, which it does. There may be something here, though.

    m.wagner said:
    I'm not sure I agree entirely - at least from my "black box perspective": the softdevice still signals NRF_EVT_RADIO_SESSION_IDLE at this point, so it does react somehow at the time the timeslot should have ended. So it looks like it should know that the application did not respect the time.

    Yes, you are right. It is the applications responsibility to clean up so that it have not configured any interrupts that will happen after the end of the timeslot on the restricted peripherals (like TIMER0). I agree that is not clearly stated though.

    m.wagner said:
    I was more trying to find out when and how this is triggered (through software? PPI? from what interrupt priority? Is there any difference in handling if temperature variation triggers the calibration?

    The decision on whether to calibrate or not depends on configuration as discussed. In the implementation a calibration timer fires regularly and the temperature is checked. If temperature has changed more than the threshold, calibration will be performed. If not a check if calibration has been skipped for too long is done, and if it is, calibration is performed. The actual calibration is done in the same way regardless of why it was triggered.

    The calibration itself consists of very little code, most importantly these steps:

    1. Request the HFXO clock
    2. Check if HW clock is running (it takes some time to start). If it is continue, if not configure an interrupt on the HFCLOCK started event so that calibration can start when clock is ready
    3. Apply first part of workaround for erratum 192.
    4. Trigger NRF_CLOCK->TASKS_CAL
    5. Wait for calibration to finish signaled by an NRF_CLOCK->EVENTS_DONE.
    6. Do last part of erratum 192 workaround
    7. release HFXO clock

    Another thing... When initializing the mesh stack with mesh_stack_init(), you pass a mesh_stack_init_params_t instance which has a member core.lfclksrc. Is this correctly configured, with the correct accuracy?

    Einar

Related