This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Softdevice Assert at PC=0x15810 (S132 7.2.0) / RTC clock drift when using timeslot

I'm trying to narrow down the cause of a Softdevice assertion happening in S132 7.2.0 at PC=0x15810.

We set up a proprietary RF project which utilises parts of the SDK for Mesh (specifically, the timeslot implementation and bearer_handler) because it provides a safe base to run high performance timeslot applications on. Unfortunately I do have one device which runs into a softdevice assertion at instruction 0x15810. I feel that it is a timing issue - maybe the device is operating at the outer limits of the clock accuracy, because while the issue appears sporadically on Development Kits or other devices, this specific device does trigger it quite often.

What exact assertion fails when at PC=0x15810?
Does the Softdevice shut down TIMER0 before doing this test or after an assertion fails?
Are timing assertions made by the softdevice based on RTC0?
Is there a reason why TIMER0 in the mesh stack is running in 24-bit mode as opposed to 32-bit mode?

Any help is greatly appreciated.

EDIT: In the meantime I think I found the cause of the issue. Assuming that the timing assertions by the Softdevice are done using RTC0, there seems to be a rather large discrepancy between the RTC0 timing and TIMER0 timing. After 9'999'249us on TIMER0 pass, RTC0 has counted 10'000'732us, so they're almost 1ms apart!

The device in question is running the LFCLK from the RC oscillator and we do usually have BLE deactivated. I did assume that the softdevice takes care of adjusting for clock drift, but could it be that I have to somehow take care of this manually?

EDIT2: Note that - as we're using the nRF SDK for Mesh as a codebase - when calculating the available time on the timeslot, we should already account for clock drift per the following calculation:

(p_timeslot->length_us * (m_lfclk_ppm + HFCLOCK_PPM_WORST_CASE)) / 1000000;

EDIT3: I previously wrote that we "do usually have BLE deactivated". What I actually mean by this is that most of the time the device is neither connected nor is it currently advertising. So there is no BLE activity to schedule by the softdevice. Timeslots are always active, though.

Top Replies

Parents

0 Einar Thorsrud over 4 years ago

Hi,

The assert at 0x15810 is because the SoftDevice got an unexpected raio interrupt, which is typically because the application used the radio outside of the timeslot. Regarding LFCLK calibration that is handled automatically by the SoftDevice. And in any case RTC drift should not be relevant here, as there is just a single low frequency clock source in the nRF, so even if there was a significant drift, the app and SoftDevice would be drifting "together".

Looking at your timeslot length calculation, it seems like you add the worst case drift value to the duration. Is that intentional? Should it not be subtracted?
Cancel
Vote Up +1 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m.wagner over 4 years ago in reply to Einar Thorsrud

Hi Einar,

Besides the issue with the discrepancies between RTC0 time and TIMER4 time, I noticed that TIMER0 is shut down while control has not yet returned to the softdevice.

Just to detail the setup: The issue seems to always occur at the same point in software. While a timeslot is granted, I have a timer running at highest IRQ priority, triggering a radio interrupt via software (NVIC_SetPendingIRQ(RADIO_IRQn)). This is used for frequency hopping and implemented like this for easier handling of the radio control state machine in our proprietary receiver.

The softdevice seems to assert after this interrupt has fired. So the cause of the assert that you have given would make sense. BUT at the point the assertion occurs, the timeslot has not ended! It was still ongoing. I added a few checks into the above IRQ handler and noticed that at the point where the assert would occur, the timeslot *should* have already ended, but for some reason TIMER0 which should be in charge of ending and cleaning up the timeslot was shutdown. I.e. the CC and INTENSET registers are cleared and when I try to manually capture the current timer value using the debuger (setting TASKS_CAPTURE[n] to 1), nothing happens. I do have no code or PPI whatsoever that would shut down TIMER0 in my software and it is only used for controlling timeslot duration - and sometimes capturing its value. So it has to be shut down by the softdevice at some point.

Could you imagine when and why TIMER0 could get shutdown by the softdevice?

Also, asking again; does the assert 0x15810 also fail if the timeslot during which the Radio IRQ is invoked went on for too long?

Thanks for your help.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud over 4 years ago in reply to m.wagner

Hi,

There must be an explanation why the TIMER0 interrupt does not occur, but I cannot say what it is. You wrote that you have a high priority timer for frequency hopping (TIMER4?). Could this ISR be somehow running constantly, blocking processing the interrupt from TIMER0?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m.wagner over 4 years ago in reply to Einar Thorsrud

The TIMER4 interrupt is invoked periodically, (roughly every 3000us, timer started in RADIO IRQ handler), but all the handler does is clear the timer's compare event register, check for a radio address event and on the condition that no radio address event is set, trigger a radio interrupt via NVIC_SetPendingIRQ(RADIO_IRQn).

(I did insert some more code to debug this issue, namely storing timer and rtc captures in a variable)

I do assume the handler runs fast enough that its run-time should be well within the 100us margin that is accounted for in TIMER0. In the above example, 1019us passed between when TIMER0 IRQ should have been active and TIMER4 IRQ was active. At the time TIMER4 was served, TIMER0 was probably shut down.

I'll have to check with the logic analyzer to get more detailed outputs. As stated before, the issue seems to occur more often with some devices than others...
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud over 4 years ago in reply to m.wagner

Hi,

m.wagner said:
As stated before, the issue seems to occur more often with some devices than others...

It is interesting that there is variation from device to device... This may be a stupid question, as you would not get the radio working without it, but could it be that you for some reason is not using the HFXO in the timeslot, but instead use the HFINT? That has a very large frequency tolerance (<±6 %). It is probably a long shot, but it could be worth double-checking.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m.wagner over 4 years ago in reply to Einar Thorsrud
Hi Einar,

No, it's definitely running on the HFXO.

In the meantime, I finally managed to look into it with a logic analyzer and could make an additional important finding: The issue occurs after the SoftDevice Event IRQ Handler has been invoked with NRF_EVT_RADIO_SESSION_IDLE.

This is a trace of the event. You see that around the time the TIMER0 interrupt should fire via the signal handler, instead the SoftDevice Event IRQ is invoked with NRF_EVT_RADIO_SESSION_IDLE.

Since this is unexpected and regular timeslot cleanup is omitted, my TIMER4 is still running on priority 0 and firing shortly after, triggering a RADIO IRQ via software.

The zoom in on the final part of the trace shows that the softdevice asserts immediately after said timer interrupt (confirming my previous assumptions):

To see what happens if the Interrupt causing the assertion is not running, I observe that... nothing happens. Nothing pertaining to timeslot activity is going on (see following trace):

Now I do not expect this event while a timeslot is ongoing - or better yet: the mesh team does not expect this event while a timeslot is ongoing. As I mentioned, the timeslot session is handled by a barely modified timeslot.c from nRF SDK for Mesh. The event handling:

case NRF_EVT_RADIO_SESSION_IDLE: /* As the SD events aren't handled atomically, we could potentially run into this scenario: * - stop the timeslot -> SESSION_IDLE event pending, session_state is OPEN * - start the timeslot again from the same context -> session_state is RUNNING * - The SESSION_IDLE event is handled. * Closing the radio session now leads to unexpected behavior, as the user restarted * the timeslot already, expecting it to remain active. */ if (m_current_timeslot.session_state == TS_SESSION_STATE_OPEN) { ASSERT_NRF_ERROR_CODE(sd_radio_session_close(), TS_CORE_ERR_SESSION_CLOSE); } break;

The variable m_current_timeslot.session_state is set to TS_SESSION_STATE_RUNNING at this point, though - so nothing happens.

A search of the DevZone also lead me to a user that seemingly had a similar issue with the actual nRF SDK for Mesh running:

SOFTDEVICE: ASSERTION FAILED while running mesh.

Unfortunately this was not followed up.

To me it looks like some sort of race condition occuring in the softdevice. I'm not entirely sure what to make of the NRF_EVT_RADIO_SESSION_IDLE event in any case.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m.wagner over 4 years ago in reply to Einar Thorsrud
Hi Einar,

In the meantime I was able to reproduce this issue with nRF SDK for Mesh 4.2.0 and a slightly modified beaconing example.

I attached a patch file which will apply the modifications I made.

mesh_sd_assert_proof.patch

Summary of my changes:

Adapted to my target hardware (disabled LEDs, buttons, use LF_SRC = RC)

Disabled beacon advertising and provisionee

Enabled and slightly modified the debug pins from core/debug_pins.h

The debug pins used in this configuration are:

P0.00 - (LF XTAL pin) - Signals whether device is in timeslot

P0.01 - (LF XTAL pin) - High while in timeslot signal handler

P0.06 - High while in BEARER_ACTION_TIMER_IRQHandler()

P0.08 - High while in app_error_fault_handler()

P0.09 - (NFCT pin) - High while TIMER0 IRQ is being served

P0.10 - (NFCT pin) - High while in timeslot.c's softdevice event handler

P0.30 - Toggles on TIMER0->EVENTS_COMPARE[0]

P0.31 - Set to high by RADIO->EVENTS_READY, to low by RADIO->EVENTS_DISABLED.

As you can see in the *.patch file attached, I ran the beaconing_nrf52832_xxAA_s132_7_0_1.emProject example in Segger Embedded Studio.

The device I tested with is a custommade/proprietary RF-Module with an nRF52832-QFAA, no LF crystal and a Murata XRCMD32M000FXP53R0 30ppm HF-XTAL.

Below you can see the corresponding trace I made with the logic analyser. As in the previous traces, around the point where the TIMER0->EVENTS_COMPARE[0] should fire, the softdevice signal handler is invoked with event NRF_EVT_RADIO_SESSION_IDLE. The next time the radio interrupt is triggered or something alike (the mesh's scanner module is still running at this point), the example crashes with Softdevice assert: 88044.

Concerning the frequency of this occurence: On this particular device I tested with, it seems to usually occur within the first 10 minutes of run time or so. On nRF52 DK devices (also configured to use LF_SRC = RC!), it seems to happen less frequently. Maybe once every couple of hours or so... So far, I have not yet been able to properly trace this on an nRF52 DK yet.

It would be great, if you could investigate this issue in more detail, as the smooth functioning of this timeslot code is essential to our application.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 m.wagner over 4 years ago in reply to Einar Thorsrud
Hi Einar,

In the meantime I was able to reproduce this issue with nRF SDK for Mesh 4.2.0 and a slightly modified beaconing example.

I attached a patch file which will apply the modifications I made.

mesh_sd_assert_proof.patch

Summary of my changes:

Adapted to my target hardware (disabled LEDs, buttons, use LF_SRC = RC)

Disabled beacon advertising and provisionee

Enabled and slightly modified the debug pins from core/debug_pins.h

The debug pins used in this configuration are:

P0.00 - (LF XTAL pin) - Signals whether device is in timeslot

P0.01 - (LF XTAL pin) - High while in timeslot signal handler

P0.06 - High while in BEARER_ACTION_TIMER_IRQHandler()

P0.08 - High while in app_error_fault_handler()

P0.09 - (NFCT pin) - High while TIMER0 IRQ is being served

P0.10 - (NFCT pin) - High while in timeslot.c's softdevice event handler

P0.30 - Toggles on TIMER0->EVENTS_COMPARE[0]

P0.31 - Set to high by RADIO->EVENTS_READY, to low by RADIO->EVENTS_DISABLED.

As you can see in the *.patch file attached, I ran the beaconing_nrf52832_xxAA_s132_7_0_1.emProject example in Segger Embedded Studio.

The device I tested with is a custommade/proprietary RF-Module with an nRF52832-QFAA, no LF crystal and a Murata XRCMD32M000FXP53R0 30ppm HF-XTAL.

Below you can see the corresponding trace I made with the logic analyser. As in the previous traces, around the point where the TIMER0->EVENTS_COMPARE[0] should fire, the softdevice signal handler is invoked with event NRF_EVT_RADIO_SESSION_IDLE. The next time the radio interrupt is triggered or something alike (the mesh's scanner module is still running at this point), the example crashes with Softdevice assert: 88044.

Concerning the frequency of this occurence: On this particular device I tested with, it seems to usually occur within the first 10 minutes of run time or so. On nRF52 DK devices (also configured to use LF_SRC = RC!), it seems to happen less frequently. Maybe once every couple of hours or so... So far, I have not yet been able to properly trace this on an nRF52 DK yet.

It would be great, if you could investigate this issue in more detail, as the smooth functioning of this timeslot code is essential to our application.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Einar Thorsrud over 4 years ago in reply to m.wagner

Hi,

Thank you for all the details. I have not made any progress on this today, but I will ask the mesh team to look at it and let you know what they find.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m.wagner over 4 years ago in reply to Einar Thorsrud

Hi Einar,

Do you have any news concerning this issue?

Regards

-mike
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud over 4 years ago in reply to m.wagner

Hi Mike,

The mesh team has started to look at it, but they have not made any interesting observations yet nor found a solution. I will let you know when they make progress of if they have some time estimates.

Einar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m.wagner over 4 years ago in reply to Einar Thorsrud

Okay, thanks for the feedback!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 m.wagner over 4 years ago in reply to Einar Thorsrud

Hi Einar,

Just a small update: I'm not 100% certain, but it looks like the issue occurs more often on PCA20020 (Thingy:52) than on PCA10040 (nRF52 DK). I did only notice this when testing our application firmware; did not test with the mesh example I provided.

(Could the core issue possibly be related to the accuracy of the 32MHz crystal in use? PCA10040 has a 10ppm crystal, PCA20020 a 40ppm crystal and as I mentioned earlier, our custom module uses a 30ppm crystal. With both "lower accuracy" devices showing said behaviour more frequently.)

Hope this helps in narrowing down the issue.

Regards,

-mike
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel