I'm trying to narrow down the cause of a Softdevice assertion happening in S132 7.2.0 at PC=0x15810.
We set up a proprietary RF project which utilises parts of the SDK for Mesh (specifically, the timeslot implementation and bearer_handler) because it provides a safe base to run high performance timeslot applications on. Unfortunately I do have one device which runs into a softdevice assertion at instruction 0x15810. I feel that it is a timing issue - maybe the device is operating at the outer limits of the clock accuracy, because while the issue appears sporadically on Development Kits or other devices, this specific device does trigger it quite often.
Any help is greatly appreciated.
EDIT: In the meantime I think I found the cause of the issue. Assuming that the timing assertions by the Softdevice are done using RTC0, there seems to be a rather large discrepancy between the RTC0 timing and TIMER0 timing. After 9'999'249us on TIMER0 pass, RTC0 has counted 10'000'732us, so they're almost 1ms apart!
The device in question is running the LFCLK from the RC oscillator and we do usually have BLE deactivated. I did assume that the softdevice takes care of adjusting for clock drift, but could it be that I have to somehow take care of this manually?
EDIT2: Note that - as we're using the nRF SDK for Mesh as a codebase - when calculating the available time on the timeslot, we should already account for clock drift per the following calculation:
(p_timeslot->length_us * (m_lfclk_ppm + HFCLOCK_PPM_WORST_CASE)) / 1000000;
EDIT3: I previously wrote that we "do usually have BLE deactivated". What I actually mean by this is that most of the time the device is neither connected nor is it currently advertising. So there is no BLE activity to schedule by the softdevice. Timeslots are always active, though.
The assert at 0x15810 is because the SoftDevice got an unexpected raio interrupt, which is typically because the application used the radio outside of the timeslot. Regarding LFCLK calibration that…
m.wagner said:Concerning drift: But the discrepancy appears between the HFCLK and LFCLK, no? TIMER0 runs off the HFCLK and is used to end the timeslot in time. The softdevice uses RTC0 which runs off…
The team has looked more into this but have not got to the bottom of it. They suggest that you can try to they increase TIMESLOT_END_SAFETY_MARGIN_US in steps of 100 us upto 1000 us.
The assert at 0x15810 is because the SoftDevice got an unexpected raio interrupt, which is typically because the application used the radio outside of the timeslot. Regarding LFCLK calibration that is handled automatically by the SoftDevice. And in any case RTC drift should not be relevant here, as there is just a single low frequency clock source in the nRF, so even if there was a significant drift, the app and SoftDevice would be drifting "together".
Looking at your timeslot length calculation, it seems like you add the worst case drift value to the duration. Is that intentional? Should it not be subtracted?
Besides the issue with the discrepancies between RTC0 time and TIMER4 time, I noticed that TIMER0 is shut down while control has not yet returned to the softdevice.
Just to detail the setup: The issue seems to always occur at the same point in software. While a timeslot is granted, I have a timer running at highest IRQ priority, triggering a radio interrupt via software (NVIC_SetPendingIRQ(RADIO_IRQn)). This is used for frequency hopping and implemented like this for easier handling of the radio control state machine in our proprietary receiver.
The softdevice seems to assert after this interrupt has fired. So the cause of the assert that you have given would make sense. BUT at the point the assertion occurs, the timeslot has not ended! It was still ongoing. I added a few checks into the above IRQ handler and noticed that at the point where the assert would occur, the timeslot *should* have already ended, but for some reason TIMER0 which should be in charge of ending and cleaning up the timeslot was shutdown. I.e. the CC and INTENSET registers are cleared and when I try to manually capture the current timer value using the debuger (setting TASKS_CAPTURE[n] to 1), nothing happens. I do have no code or PPI whatsoever that would shut down TIMER0 in my software and it is only used for controlling timeslot duration - and sometimes capturing its value. So it has to be shut down by the softdevice at some point.
Could you imagine when and why TIMER0 could get shutdown by the softdevice?
Also, asking again; does the assert 0x15810 also fail if the timeslot during which the Radio IRQ is invoked went on for too long?
Thanks for your help.
m.wagner said:Concerning drift: But the discrepancy appears between the HFCLK and LFCLK, no? TIMER0 runs off the HFCLK and is used to end the timeslot in time. The softdevice uses RTC0 which runs off the LFCLK to assert that the timeslot is ended in time. Am I wrong here? How could you explayn the discrepancy I observe?
Yes, if you use a timer peripheral to time the time slot duration, that is correct. I did not consider that, as you would typically use a low power timer for such applications (like the SoftDevice, which use RTC0). But with a timer, then yes, you would use independent clock sources. But the LFCLK accuracy would worst case be 500 ppm (calibrated LFRC), so as long as you take that into account it should not be a problem (see this note).
m.wagner said:Besides the issue with the discrepancies between RTC0 time and TIMER4 time, I noticed that TIMER0 is shut down while control has not yet returned to the softdevice.
This can only happen after the time slot has ended.
m.wagner said:Could you imagine when and why TIMER0 could get shutdown by the softdevice?
If the app has overstayed the timeslot, that could happen (SoftDevice would run again from an RTC0 interrupt).
m.wagner said:Also, asking again; does the assert 0x15810 also fail if the timeslot during which the Radio IRQ is invoked went on for too long?
If I understand the question correctly, then Indirectly, yes - if you are using the radio after the time slot has elapsed. If on the other hand the question is what happens if if you get this if the IRQ is being processed (interrupt occured before, but ISR takes a long time), then no - in that case you would not get this assert.
Thanks for your response. I just re-calculated the drifting again and realised that the discrepancy I mentioned above should be well within the margins for drift that I (or the nRF Mesh SDH team...) have accounted for.
So, for example in my current measurement, 327'716 ticks elapsed on RTC0 while 9'999'746 us were measured with TIMER0.
t_diff = rtc_ticks * 1000000 / 32768 - timer0_us
= 327716 * 1000000 / 32768 - 9999746
With the requested timeslot duration of 10'000'000us, and the formula to calculate the end time, we get a maximum drift of...
t_drift = end_margin + overhead + ((t_length * (lfclk_ppm + hfclk_ppm)) / 1000000)
= 100 + 20 + ((10000000 * (500 + 40)) / 1000000)
So the above 1352us are well within the 5520us.
What seems to be the issue is that for some reason the TIMER0 IRQ that should clean up and end the timeslot is never invoked.
Software event tracing showed that before the crash occurs, the radio signal handler is only ever invoked when triggered with the method described above. For clarification, I drew up what seems to happen. (sorry for the scribbled text, hope it's readable)
So at the point the SD asserts, in the example I'm looking at now, 1019us elapsed between when the TIMER0 IRQ handler should have been invoked and the measurement I made in the TIMER4 IRQ handler.
What is weird is that the TIMER0 IRQ handler is never invoked and at the point the softdevice asserts, the drift of RTC0 and TIMER0 seems to be well within the maximum allowed range.
There must be an explanation why the TIMER0 interrupt does not occur, but I cannot say what it is. You wrote that you have a high priority timer for frequency hopping (TIMER4?). Could this ISR be somehow running constantly, blocking processing the interrupt from TIMER0?
The TIMER4 interrupt is invoked periodically, (roughly every 3000us, timer started in RADIO IRQ handler), but all the handler does is clear the timer's compare event register, check for a radio address event and on the condition that no radio address event is set, trigger a radio interrupt via NVIC_SetPendingIRQ(RADIO_IRQn).
(I did insert some more code to debug this issue, namely storing timer and rtc captures in a variable)
I do assume the handler runs fast enough that its run-time should be well within the 100us margin that is accounted for in TIMER0. In the above example, 1019us passed between when TIMER0 IRQ should have been active and TIMER4 IRQ was active. At the time TIMER4 was served, TIMER0 was probably shut down.
I'll have to check with the logic analyzer to get more detailed outputs. As stated before, the issue seems to occur more often with some devices than others...
m.wagner said:As stated before, the issue seems to occur more often with some devices than others...
It is interesting that there is variation from device to device... This may be a stupid question, as you would not get the radio working without it, but could it be that you for some reason is not using the HFXO in the timeslot, but instead use the HFINT? That has a very large frequency tolerance (<±6 %). It is probably a long shot, but it could be worth double-checking.