This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Softdevice Assert at PC=0x15810 (S132 7.2.0) / RTC clock drift when using timeslot

I'm trying to narrow down the cause of a Softdevice assertion happening in S132 7.2.0 at PC=0x15810.

We set up a proprietary RF project which utilises parts of the SDK for Mesh (specifically, the timeslot implementation and bearer_handler) because it provides a safe base to run high performance timeslot applications on. Unfortunately I do have one device which runs into a softdevice assertion at instruction 0x15810. I feel that it is a timing issue - maybe the device is operating at the outer limits of the clock accuracy, because while the issue appears sporadically on Development Kits or other devices, this specific device does trigger it quite often.

  • What exact assertion fails when at PC=0x15810?
  • Does the Softdevice shut down TIMER0 before doing this test or after an assertion fails?
  • Are timing assertions made by the softdevice based on RTC0?
  • Is there a reason why TIMER0 in the mesh stack is running in 24-bit mode as opposed to 32-bit mode?

Any help is greatly appreciated.

EDIT: In the meantime I think I found the cause of the issue. Assuming that the timing assertions by the Softdevice are done using RTC0, there seems to be a rather large discrepancy between the RTC0 timing and TIMER0 timing. After 9'999'249us on TIMER0 pass, RTC0 has counted 10'000'732us, so they're almost 1ms apart!

The device in question is running the LFCLK from the RC oscillator and we do usually have BLE deactivated. I did assume that the softdevice takes care of adjusting for clock drift, but could it be that I have to somehow take care of this manually?

EDIT2: Note that - as we're using the nRF SDK for Mesh as a codebase - when calculating the available time on the timeslot, we should already account for clock drift per the following calculation:

(p_timeslot->length_us * (m_lfclk_ppm + HFCLOCK_PPM_WORST_CASE)) / 1000000;

EDIT3: I previously wrote that we "do usually have BLE deactivated". What I actually mean by this is that most of the time the device is neither connected nor is it currently advertising. So there is no BLE activity to schedule by the softdevice. Timeslots are always active, though.

Parents
  • Hi,

    The assert at 0x15810 is because the SoftDevice got an unexpected raio interrupt, which is typically because the application used the radio outside of the timeslot. Regarding LFCLK calibration that is handled automatically by the SoftDevice. And in any case RTC drift should not be relevant here, as there is just a single low frequency clock source in the nRF, so even if there was a significant drift, the app and SoftDevice would be drifting "together".

    Looking at your timeslot length calculation, it seems like you add the worst case drift value to the duration. Is that intentional? Should it not be subtracted?

  • Hi Einar,

    Besides the issue with the discrepancies between RTC0 time and TIMER4 time, I noticed that TIMER0 is shut down while control has not yet returned to the softdevice.

    Just to detail the setup: The issue seems to always occur at the same point in software. While a timeslot is granted, I have a timer running at highest IRQ priority, triggering a radio interrupt via software (NVIC_SetPendingIRQ(RADIO_IRQn)). This is used for frequency hopping and implemented like this for easier handling of the radio control state machine in our proprietary receiver.

    The softdevice seems to assert after this interrupt has fired. So the cause of the assert that you have given would make sense. BUT at the point the assertion occurs, the timeslot has not ended!  It was still ongoing. I added a few checks into the above IRQ handler and noticed that at the point where the assert would occur, the timeslot *should* have already ended, but for some reason TIMER0 which should be in charge of ending and cleaning up the timeslot was shutdown. I.e. the CC and INTENSET registers are cleared and when I try to manually capture the current timer value using the debuger (setting TASKS_CAPTURE[n] to 1), nothing happens. I do have no code or PPI whatsoever that would shut down TIMER0 in my software and it is only used for controlling timeslot duration - and sometimes capturing its value. So it has to be shut down by the softdevice at some point.

    Could you imagine when and why TIMER0 could get shutdown by the softdevice?

    Also, asking again; does the assert 0x15810 also fail if the timeslot during which the Radio IRQ is invoked went on for too long?

    Thanks for your help.

  • Hi Mike,

    I am sorry this has take some time. I have talked to the mesh team and they are looking into this. I will let you know when they make some progress or have some interesting observations to share.

    Einar

  • Hi Mike,

    The team has looked more into this but have not got to the bottom of it. They suggest that you can try to they increase TIMESLOT_END_SAFETY_MARGIN_US in steps of 100 us upto 1000 us.

    Einar

  • Hi Einar,

    I tried to subsequently increase the TIMESLOT_END_SAFETY_MARGIN_US until either not observing any Softdevice Asserts or reaching 1000us. Unfortunately, I did always observe the described softdevice assert at some point. It looks like increasing it reduces the probability of the asserts' occurence, though which is at least something.

    Still hoping for a definitive solution!

    Thank you & best regards,

    -mike

  • Hi Mike,

    This is a tricky one it seems. I cannot promise anything at this point but I have forwarded this to the team. I will let you know if they make any progress.

    Einar

  • Hi Einar,

    While trying to reproduce another timeslot-related issue (Softdevice event handler continuously called with NRF_EVT_RADIO_BLOCKED), I found another angle by which I seem to be able to reproduce the same basic issue.

    Hung Bui, the support engineer in the linked devzone entry, convinced me to try to reproduce the other issue based on his nRF52 BLE - ESB Timeslot example. Doing this, I managed to reproduce this issue here, wehere the softdevice event handler unexpectedly gets called with NRF_EVT_RADIO_SESSION_IDLE on an nRF52DK (PCA10040).

    Steps to reproduce with his example:

    1. Get the linked example to run as per its instructions.
    2. Attach the patch attached to this comment
    3. Run the example
    4. Connect a BLE central to the peripheral (advertising as "Nordic_UART").

    This will result in NRF_EVT_RADIO_SESSION_IDLE being invoked within a few milliseconds after establishing the BLE connection.

    The patch contains the following changes:

    • Added and enabled GPIOs for debugging:
      • P0.03 - Signals whether device is in timeslot
      • P0.04 - High while in radio callback (signal ID is indicated by repeated toggles)
      • P0.28 - Set when callback_action set to NRF_RADIO_SIGNAL_CALLBACK_ACTION_EXTEND, cleared on EXTEND_SUCCEEDED or EXTEND_FAILED
      • P0.29 - High while serving SIGNAL_TYPE_TIMER0
      • P0.30 - High while in softdevice signal handler (signal ID is indicated by repeated toggles)
      • P0.31 - High while serving TIMER0 interrupt where timeslot is ended.
      • P0.25 - Indicates EXTEND_SUCCEEDED.
    • Reduced the length_us parameter in the radio request to 3800 in case a timeslot is BLOCKED or CANCELED.

    example-uart-esb-timeslot.patch

    So, the issue occurs immediately if the length_us of the requested timeslot is reduced from 5000 to 3800. I did not play around with the values any further!

    I hope this may help you in getting to the cause of this issue! In the meantime, I'll see if increasing the minimum length requested may resolve it in may application!

    Best regards,

    -mike

Reply
  • Hi Einar,

    While trying to reproduce another timeslot-related issue (Softdevice event handler continuously called with NRF_EVT_RADIO_BLOCKED), I found another angle by which I seem to be able to reproduce the same basic issue.

    Hung Bui, the support engineer in the linked devzone entry, convinced me to try to reproduce the other issue based on his nRF52 BLE - ESB Timeslot example. Doing this, I managed to reproduce this issue here, wehere the softdevice event handler unexpectedly gets called with NRF_EVT_RADIO_SESSION_IDLE on an nRF52DK (PCA10040).

    Steps to reproduce with his example:

    1. Get the linked example to run as per its instructions.
    2. Attach the patch attached to this comment
    3. Run the example
    4. Connect a BLE central to the peripheral (advertising as "Nordic_UART").

    This will result in NRF_EVT_RADIO_SESSION_IDLE being invoked within a few milliseconds after establishing the BLE connection.

    The patch contains the following changes:

    • Added and enabled GPIOs for debugging:
      • P0.03 - Signals whether device is in timeslot
      • P0.04 - High while in radio callback (signal ID is indicated by repeated toggles)
      • P0.28 - Set when callback_action set to NRF_RADIO_SIGNAL_CALLBACK_ACTION_EXTEND, cleared on EXTEND_SUCCEEDED or EXTEND_FAILED
      • P0.29 - High while serving SIGNAL_TYPE_TIMER0
      • P0.30 - High while in softdevice signal handler (signal ID is indicated by repeated toggles)
      • P0.31 - High while serving TIMER0 interrupt where timeslot is ended.
      • P0.25 - Indicates EXTEND_SUCCEEDED.
    • Reduced the length_us parameter in the radio request to 3800 in case a timeslot is BLOCKED or CANCELED.

    example-uart-esb-timeslot.patch

    So, the issue occurs immediately if the length_us of the requested timeslot is reduced from 5000 to 3800. I did not play around with the values any further!

    I hope this may help you in getting to the cause of this issue! In the meantime, I'll see if increasing the minimum length requested may resolve it in may application!

    Best regards,

    -mike

Children
  • Hi Mike,

    Thanks for the update. I will forward the information to the team.

  • Hi Einar,

    Thanks. Just for the sake of completeness, I also attached the Saleae Logic 2 traces I screenshotted further up. The first one is from our application, the second one from the mesh example.

    softdevice_assert_logic_traces.zip

    Additionally, I'd like to ask what the Softdevice assert at PC=0x24f24 indicates? I assume this is the one that gets raised when the timeslot was ended too late? I very rarely see this one (as opposed to the "regular" one at PC=0x15810) and assume it can be averted by the workaround suggested above, increasing TIMESLOT_END_SAFETY_MARGIN_US.

    Looking forward to the results of the team & thanks for your support!

    -mike

  • Hi Mike,

    Thanks for the traces.

    Regarding the assert at PC=0x24f24 that can happen if the application used too much time in the callback handler as the SoftDevice timing may be thrown off.

    Einar

  • Hi Einar,

    I looked a bit more into the behaviour of the device when this issue occurs - and tried to implement a workaround, too.

    Tracing RTC0->EVENTS_COMPARE[1]

    I did trace RTC0->EVENTS_COMPARE[1] via PPI with a logic analyzer, because this seems to be the timing reference the softdevice uses to assert timeslots are ended in time.

    It looks like the compare event fires significantly earlier than normal when the unexpected behaviour occurs even though the timeslot was extended to the exact same duration as every other timeslot before.

    I noticed that during undisrupted/regular operation, the period of RTC0->EVENTS_COMPARE[1] is in the range of 10'000'457us to 10'000'486us, with a standard deviation of ~28us, whereas the compare-event that occurs right before the unexpeced IDLE event fires is raised after e.g. 9'999'971us, so 486us earlier than the earliest of the other compare events, which is an order of magnitude more.

    It looks like NRF_EVT_RADIO_SESSION_IDLE is then invoked within around 60us of RTC0->EVENTS_COMPARE[1], so lookis like this is triggered from the RTC0 COMPARE1 interrupt handler.

    See the following logic analyzer trace (attached as Saleae Logic 2 data file, too).

    sd-assert-rtc-debug-1.zip


    Tracing RTC0->CC[1]
    I also looked at how the RTC0->CC[1] value changes when timeslots are extended. The nRF SDK for Mesh timeslot extension algorithm initially (when invoking sd_radio_request()) tries to reserve a "short" timeslot of 14000us duration and then tries to match the previous length by extending it up to 10'000'000us (without BLE activity this will be achieved).

    I tried to see whether there were any anomalies when calculating the CC[1] value for the RTC, but did not observe any. When receiving NRF_RADIO_CALLBACK_SIGNAL_TYPE_START, the difference between NRF_RTC0->COUNTER and NRF_RTC0->CC[1] is consistently at 459, corresponding to around 14'000us. After extending to 10s (NRF_RADIO_CALLBACK_SIGNAL_TYPE_EXTEND_SUCCEED), the difference is consistently at 327681, corresponding to ~10'000'000us. This is no different for the timeslot in which the observed issue occurs.

    Looking at another logic analyzer trace where I also traced the LFCLK, (RTC0 TICK event via PPI), I came to the conclusion that the reason of the early compare event apparently seems to be RC clock drift that is larger than assumed.

    The amount of ticks counted from about the time the timeslot extension concluded up to the point the RTC0 COMPARE1 event fires was consistently higher for the period after which NRF_EVT_RADIO_SESSION_IDLE is invoked, while the measured duration was consistently significantly shorter.

    I attached the Logic trace (recorded with the below workaround active, so there are no Softdevice asserts happening...)

    sd-assert-measure-rtc-tick.zip

    Workaround

    I figured - since I can actually detect the unexpected NRF_EVT_RADIO_SESSION_IDLE, I could try a workaround where I just force stop all my timeslot activity (while disabling the MWU region that watches RADIO and TIMER0...) and re-order a new timeslot. It seems to be a viable workaround since in our application this issue seems to occur only or at least mostly while there is no BLE activity (advertising or connection) at all.

    A bug in the timeslot_timer prevented the workaround to work from the get-go, but once I figured that out, it started to look promising.

    I do not include the code here, since it's probably a bad idea to actually use it...

    The actual solution

    As it turns out, the actual solution to the problem seems to be way simpler and less hacky than my workaround. It looks like for some reason the "normal" values of NRF_SDH_CLOCK_LF_RC_CTIV (16, corresponding to 4 seconds) and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV (2, corresponding to force recalibration every second interval - so I assume 8 seconds), seem to not result in an accurate enough LFCLK. I reduced NRF_SDH_CLOCK_LF_RC_CTIV to 4 (corresponding to every second) and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV to 0 (corresponding to recalibration at every interval).

    So far, this has been running stable even on the "wonkier" devices. So two questions remain:

    1. How do I determine the "correct" values for NRF_SDH_CLOCK_LF_RC_CTIV and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV? Why were the defaults not appropriate?
    2. Why does the softdevice not assert once it knows that timeslot is actually overstretched, but only fire the IDLE event? The COMPARE0-IRQ of RTC0 is enabled and active, after all so the possibility for the assert is there! This looks like unexpected behaviour by the softdevice.

    EDIT:

    3. How is the RC oscillator calibrated? Does this run in hardware entirely or is there code running, taking care of this? Is this running in interrupt handlers and at what priority? With As far as I understand , a temperature measurement is required if NRF_SDH_CLOCK_LF_RC_TEMP_CTIV is >= 2. Would be great if you could elaborate on this!

  • Hi Mike,

    These are very interesting findings.

    m.wagner said:
    How do I determine the "correct" values for NRF_SDH_CLOCK_LF_RC_CTIV and NRF_SDH_CLOCK_LF_RC_TEMP_CTIV? Why were the defaults not appropriate?

    This is a good question. I have not found any reports of similar issues with the LFRC. Using regular calibration it should stay within +-500 ppm. If there is an issue here I would expect many other customers seeing a set of other issues as well (for instance missing BLE timing etc.), so it is a bit strange. We have found the default configuration, which you have described, to be good enough to not cause problems while not calibrating too often (in order to save power). I am not able to explain why this does not seem to be the case here. If it was just one device then I would suspect an issue with that specific device, but that doe snot seem to be the case her based on earlier discussions - though there are variations. 

    m.wagner said:
    Why does the softdevice not assert once it knows that timeslot is actually overstretched, but only fire the IDLE event? The COMPARE0-IRQ of RTC0 is enabled and active, after all so the possibility for the assert is there! This looks like unexpected behaviour by the softdevice.

    I see your point, but that is not in line with how the timeslot API is designed. The application does not signal that the timeslot has ended (sd_radio_session_close() is only called when a session ends, i.e. when the app no longer needs timeslots), so the SoftDevice does not know that the app did not respect the time until something unexpectedly happens when it should not.

    m.wagner said:
    How is the RC oscillator calibrated? Does this run in hardware entirely or is there code running, taking care of this? Is this running in interrupt handlers and at what priority? With As far as I understand , a temperature measurement is required if NRF_SDH_CLOCK_LF_RC_TEMP_CTIV is >= 2. Would be great if you could elaborate on this!

    Calibration is done in hardware. The SoftDevice simply starts the HFXO and triggers the CAL task (NRF_CLOCK_TASK_CALL) to trigger calibration and HW does the rest.

    Einar

Related