This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Sporadic Notification Bursts

Hi everybody,

we have a remote control based on the nRF52832 and are experiencing a wierd issue with notifications.

We are using a custom 6 byte payload keymap characteristic in a custom service, that uses notifications to send keypresses to the host. Usually this works fine. However when the remote is left idle for a couple of minutes, sporadically the next few notifications are bursted several times. This not an issue of our code calling the sd_ble_gatts_hvx funtion too often (we tested guarding it with app timer checks). If this issue occurs then the whole timeslot is filled with identical notifications, with the more data flag being set. Sometimes the issue is resolved with the next notification (key release), sometimes it continues for the next few button presses too, sometimes the release code is lost.

You can see one of these bursts as captured by a PCA10056 sniffer in Wireshark. Note that I have never seen one of the bursts before a LL_PING_REQ, but a ping does not reliably trigger this behaviour (in fact there are several pings, followed by normal notifications just before in this log).

Our initial project is based on the SDK 13.0.0 and the Bluetooth Developer Studio example using Softdevice 4.0.5. For debugging the issue the project was ported to SDK 14.2.0 with SD 5.1.0. We are using GNU Tools ARM Embedded/5.4 2016q2 on Windows 10 x64, but have tried the default GNU Tools ARM Embedded/4.9 2015q3 as well. Neither the SDK/SD nor the compiler switch changed the behaviour. It occurs connected to our custom central (nRF52832) as well as connected to an Android 9 Smartphone running nRF Connect. It does not seem to be a signal strength issue, since it occurs even with central reported RSSIs of -58dBm.

We do use a few non-standard softdevice settings for longer MTUs and event lengths due to other data heavy services, but since the problem occurs even without them those should not interfere here.

I have attached a minimal project, that shows this behaviour on the PCA10040 board. The only services used are the battery levels and the remote control service. It needs a plain SDK 14.2.0 and SD 5.1.0. Place the projects dir in the SDK root and replace the SDK bsp.h with the one from the archive (adds custom events). 

Connect to the "Remote PCA10040" device, enable service 0xdb60's 0xdb61 characteristics notifications and press the buttons. On push each button transmits a 6 byte bitmap that resembles our remote control. This message is repeated every 100ms until the button is released, at which point an empty bitmap is transmitted.

demo_minimal.zip

Could you please help us fixing this wierd issue? We are really running out of ideas what might cause this..

Thank you!

Jann

Parents
  • Hello Jann,

    Is it possible to send the .pcapng sniffertrace?

    Is the project suppose to send continuous notifications as long as you hold a button? Because it does that, but not at the rate that it does in your screenshot.

    I tried to replicate the issue using SDK14.2.0, S132 v5.1.0 andGNU Tools ARM Embedded 5.4 2016q2, but I have not been able to reproduce what you report. Is there any tricks? Any particular pattern I need to do to reproduce it? What do you connect to? nRF Connect? On Mobile or Desktop?

    This not an issue of our code calling the sd_ble_gatts_hvx funtion too often

    We will see :)

    Best regards,

    Edvin

  • Hi Edvin,

    thank you for your reply!

    Sure, I attached a trace for you. It was made with the PCA10040 demo, connected to an Android Smartphone with nRF Connect (just filtered out empty PDUs). The screenshot is from the actual remote control, the demo is just an approximation to replicate the behavior. The Demo outputs bitmaps at 100ms interval until the button is released, then instantly a single empty bitmap, the remote is a few ms different in regards of timing. The huge amount of notifications directly following each other with "more data" set to true in the screenshot however are the problem I am talking about. More data is only supposed to be able to be true rarely on a release code when button release and the scheduled 100ms retransmission happen to overlap.

    In the pcapng file you can see that after setup the first notifications up to number 137 are ok, there are a few set bitmasks as long as a button is held, then an empty one. Then the board is just left alone for a few minutes until at no 142 a button is pressed again and a notification burst is sent. The empty release code notification is also lost here. Starting a no 200 notifications work properly again.

    Thank you & best regards,

    Jann

    pca10040remote-filt.pcapng

  • Hello Jann,

    I see, and thank you for the pcapng file. 

    Do you have a project that has logging enabled that also replicate this issue?

  • Hello Edvin,

    I've attached the demo project with enabled log for you.

    There is a log_info call on every button press, release and repeat, as well as in the called rcu_update_keys function that takes care of transmission. Interestingly if a burst occurs the log shows a single button down and release event, but a multitude of repeats and transmissions are logged. It seems as if the repeat app timer goes berserk. Judging from the time it takes until the issue occurs it might be coming from the RTC overflow. And since we used app timer routines for guarding off the hvx call that explains why it didn't work out either.

    Are there any additional precautions to take with repeated app timers? Should we rather use single shot timers and restart them inside the handler? Or is it recommended to use the app timer in scheduler mode?

    Thank you & kind regards,

    Jann

    demo-minimal-log.zip

Reply
  • Hello Edvin,

    I've attached the demo project with enabled log for you.

    There is a log_info call on every button press, release and repeat, as well as in the called rcu_update_keys function that takes care of transmission. Interestingly if a burst occurs the log shows a single button down and release event, but a multitude of repeats and transmissions are logged. It seems as if the repeat app timer goes berserk. Judging from the time it takes until the issue occurs it might be coming from the RTC overflow. And since we used app timer routines for guarding off the hvx call that explains why it didn't work out either.

    Are there any additional precautions to take with repeated app timers? Should we rather use single shot timers and restart them inside the handler? Or is it recommended to use the app timer in scheduler mode?

    Thank you & kind regards,

    Jann

    demo-minimal-log.zip

Children
  • Hmm nope, switching to a single shot timer doesn't help. As soon as I start the 100ms timer it just keeps firing away. Even force-stopping it before restarting (in single shot mode) doesn't help.. As if the timer op queue was all filled up with calls to the repeat function..

    €: I've now added an output of app_timer_cnt_get every time the repeat timer callback is executed and I see that once the issue occurs the time between the calls drop from the requested 100ms to ~2ms. Also it does not seem to be wraparound related, as it started at a count of about 0x00fc0000 and went on well past the overflow upto about 0x00050700. Next wraparound didnt show the issue at all.

    I tried decoupling the TX from the timer interrupt by just setting a flag and evaluating that in the main loop, but with 2ms between calls that of course didnt help. 

  • I've now switched the app timer to the scheduler based version and here it gets kind of interesting: Normally this works without problems, but when the issue occurs, the timeout function is called endlessly (with the non-scheduler approach the timeout callback is called a few times and then stops on its own, presumably when the queue is full). Eventually triggering a timer start or stop by pressing a button stops it, but not always. 

    Suffice to say I am pretty confused by now :)

    €: So it's not the queue being filled up. Turned scheduler off again, built the demo with the profiler on and printed the app timer queue utilization with every timeout and it never rises over 3. Also built the app timer with the keep rtc active flag, but no dice.

  • Hello Jann,

    Sorry for the late reply. I was out of office on Friday. 

    I am having some issues replicating the issue. I did manage to reproduce it a couple of times with the project that you sent me last week, but not with the new  one.

    However, I recall that there was a bug in the app_timer at some point in time. I remembered it when you said "the timeout function is called endlessly". I will look for more details regarding this bug, but can you check one thing for me:

    I see that you have (by default) only one app_timer running while you are not pressing any buttons, which is the battery timer, is that correct?

    The bug was present while there only was one active timer, which had a long timeout (longer than half of the app_timer maximum count value, that is 24 bits = 0xFFFFFF = 16 777 215.

    I believe the app_timer is capable of having timeouts longer thant he maximum count value (it wraps around a couple of times if it needs to), but the bug is related to the fact that there is only one active timer with a timeout of over half of this.

    So, to work around this, you have two options. Either, you can increase the prescaler for the app_timer, so that the BATTERY_LEVEL_MEAS_INTERVAL will be less than 0xFFFFFF/2, or you can create a dummy app_timer instance. A timer that times out every minute or so, that doesn't do anything. It just has an empty timeout handler.

    Can you test this for me, and see if that solves the issue?

    Best regards,

    Edvin

  • Hi Edvin,

    not a problem, I've been on vacation the last week and just got back into the office today myself - hence my late reply.

    Yes, there is only the battery measurement timer running permanently, all other timers are stopped in idle mode.

    Thank you for the hint with the app_timer bug. I've now added a continuous idle timer at 0x7fff00 ticks and so far the issue has not occurred again in the test setup. If this does the trick here I'll add the timer to our main project and see if that fixes the issue for good. I'll let you know how it went!

    Thank you & best wishes,

    Jann

  • Hello Jann,

    That is good news. 

    Yes. I believe this bug has been present in some SDK versions, but a bit difficult to remember exactly which. I believe it came, was fixed, and came back again at some point before being fixed again ...

    Let me know if it doesn't solve the issue. 

    Best regards, 

    Edvin

Related