I'm trying to narrow down the cause of a Softdevice assertion happening in S132 7.2.0 at PC=0x15810.
We set up a proprietary RF project which utilises parts of the SDK for Mesh (specifically, the timeslot implementation and bearer_handler) because it provides a safe base to run high performance timeslot applications on. Unfortunately I do have one device which runs into a softdevice assertion at instruction 0x15810. I feel that it is a timing issue - maybe the device is operating at the outer limits of the clock accuracy, because while the issue appears sporadically on Development Kits or other devices, this specific device does trigger it quite often.
Any help is greatly appreciated.
EDIT: In the meantime I think I found the cause of the issue. Assuming that the timing assertions by the Softdevice are done using RTC0, there seems to be a rather large discrepancy between the RTC0 timing and TIMER0 timing. After 9'999'249us on TIMER0 pass, RTC0 has counted 10'000'732us, so they're almost 1ms apart!
The device in question is running the LFCLK from the RC oscillator and we do usually have BLE deactivated. I did assume that the softdevice takes care of adjusting for clock drift, but could it be that I have to somehow take care of this manually?
EDIT2: Note that - as we're using the nRF SDK for Mesh as a codebase - when calculating the available time on the timeslot, we should already account for clock drift per the following calculation:
(p_timeslot->length_us * (m_lfclk_ppm + HFCLOCK_PPM_WORST_CASE)) / 1000000;
EDIT3: I previously wrote that we "do usually have BLE deactivated". What I actually mean by this is that most of the time the device is neither connected nor is it currently advertising. So there is no BLE activity to schedule by the softdevice. Timeslots are always active, though.
The assert at 0x15810 is because the SoftDevice got an unexpected raio interrupt, which is typically because the application used the radio outside of the timeslot. Regarding LFCLK calibration that…
m.wagner said:Concerning drift: But the discrepancy appears between the HFCLK and LFCLK, no? TIMER0 runs off the HFCLK and is used to end the timeslot in time. The softdevice uses RTC0 which runs off…
The team has looked more into this but have not got to the bottom of it. They suggest that you can try to they increase TIMESLOT_END_SAFETY_MARGIN_US in steps of 100 us upto 1000 us.
The assert at 0x15810 is because the SoftDevice got an unexpected raio interrupt, which is typically because the application used the radio outside of the timeslot. Regarding LFCLK calibration that is handled automatically by the SoftDevice. And in any case RTC drift should not be relevant here, as there is just a single low frequency clock source in the nRF, so even if there was a significant drift, the app and SoftDevice would be drifting "together".
Looking at your timeslot length calculation, it seems like you add the worst case drift value to the duration. Is that intentional? Should it not be subtracted?
Besides the issue with the discrepancies between RTC0 time and TIMER4 time, I noticed that TIMER0 is shut down while control has not yet returned to the softdevice.
Just to detail the setup: The issue seems to always occur at the same point in software. While a timeslot is granted, I have a timer running at highest IRQ priority, triggering a radio interrupt via software (NVIC_SetPendingIRQ(RADIO_IRQn)). This is used for frequency hopping and implemented like this for easier handling of the radio control state machine in our proprietary receiver.
The softdevice seems to assert after this interrupt has fired. So the cause of the assert that you have given would make sense. BUT at the point the assertion occurs, the timeslot has not ended! It was still ongoing. I added a few checks into the above IRQ handler and noticed that at the point where the assert would occur, the timeslot *should* have already ended, but for some reason TIMER0 which should be in charge of ending and cleaning up the timeslot was shutdown. I.e. the CC and INTENSET registers are cleared and when I try to manually capture the current timer value using the debuger (setting TASKS_CAPTURE[n] to 1), nothing happens. I do have no code or PPI whatsoever that would shut down TIMER0 in my software and it is only used for controlling timeslot duration - and sometimes capturing its value. So it has to be shut down by the softdevice at some point.
Could you imagine when and why TIMER0 could get shutdown by the softdevice?
Also, asking again; does the assert 0x15810 also fail if the timeslot during which the Radio IRQ is invoked went on for too long?
Thanks for your help.
The mesh team has started to look at it, but they have not made any interesting observations yet nor found a solution. I will let you know when they make progress of if they have some time estimates.
Okay, thanks for the feedback!
Just a small update: I'm not 100% certain, but it looks like the issue occurs more often on PCA20020 (Thingy:52) than on PCA10040 (nRF52 DK). I did only notice this when testing our application firmware; did not test with the mesh example I provided.
(Could the core issue possibly be related to the accuracy of the 32MHz crystal in use? PCA10040 has a 10ppm crystal, PCA20020 a 40ppm crystal and as I mentioned earlier, our custom module uses a 30ppm crystal. With both "lower accuracy" devices showing said behaviour more frequently.)
Hope this helps in narrowing down the issue.
Thank you for that input, that could be relevant. I have forwarded it to the mesh team.
It's been 10 days since I heard from you and tought I should ask if there are any new findings or estimates as to when there could be!
The issue is somewhat urgent as it concerns a product which is running in pre-production testing and the frequent resets - all due to the described softdevice asserts - seriously impact the usability of the product.
Thank you for your help & best regards,
I am sorry this has take some time. I have talked to the mesh team and they are looking into this. I will let you know when they make some progress or have some interesting observations to share.
I tried to subsequently increase the TIMESLOT_END_SAFETY_MARGIN_US until either not observing any Softdevice Asserts or reaching 1000us. Unfortunately, I did always observe the described softdevice assert at some point. It looks like increasing it reduces the probability of the asserts' occurence, though which is at least something.
Still hoping for a definitive solution!
Thank you & best regards,
This is a tricky one it seems. I cannot promise anything at this point but I have forwarded this to the team. I will let you know if they make any progress.
While trying to reproduce another timeslot-related issue (Softdevice event handler continuously called with NRF_EVT_RADIO_BLOCKED), I found another angle by which I seem to be able to reproduce the same basic issue.
Hung Bui, the support engineer in the linked devzone entry, convinced me to try to reproduce the other issue based on his nRF52 BLE - ESB Timeslot example. Doing this, I managed to reproduce this issue here, wehere the softdevice event handler unexpectedly gets called with NRF_EVT_RADIO_SESSION_IDLE on an nRF52DK (PCA10040).
Steps to reproduce with his example:
This will result in NRF_EVT_RADIO_SESSION_IDLE being invoked within a few milliseconds after establishing the BLE connection.
The patch contains the following changes:
So, the issue occurs immediately if the length_us of the requested timeslot is reduced from 5000 to 3800. I did not play around with the values any further!
I hope this may help you in getting to the cause of this issue! In the meantime, I'll see if increasing the minimum length requested may resolve it in may application!