This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

disconnecting while operations are in progress never gives BLE_GAP_EVT_DISCONNECTED event

2020-01-24-092119EST-ProductStoppedGettingEventsFromNordicDK.txtImprivataTestNordicEventsNotReceived.zipCalls_to_pc_ble_driver.cpp0285.2020-02-24-TestProgramUploadedToNordicSupport.zipFeb25TestProgramUploadedToNordicSupport.zipImprivata_bgTestApp.zipbgSDKTestAppMay4.zip2020-05-05-035347-NordicDK_USB840M_200505_ClockInternal_2in1.hex.txt.txtbgSDKTestAppMay6.zipI’m developing an application based on pc-ble-driver to talk to an nRF52840-based dongle (from Fanstel).

I’m having trouble disconnecting cleanly when a connection has operations in progress.  For example, I call ‘sd_ble_gattc_write’, which returns NRF_SUCCESS, but I don’t receive event BLE_GATTC_EVT_WRITE_RSP (after waiting 60 seconds), so I decide to disconnect. When this happens, sd_ble_gap_disconnect returns NRF_SUCCESS, but I do not receive BLE_GAP_EVT_DISCONNECTED even after waiting 30 seconds. The connection supervision timeout is 4 seconds.  What could cause the disconnect to not generate any BLE_GAP_EVT_DISCONNECTED event?

What I’m trying to accomplish here: if a connection is not responsive, I want to end that connection, without disturbing other connections I have open.

Thanks,

Paul Bradford

Parents
  • I have attached a new test program Imprivata_bgTestApp.zip. There is a file called README with instructions, and a description of the problem.   I suggest you try this on a Dell 64-bit Windows 10 laptop that has not had Nordic dev tools installed on it. Usually the problem occurs with in an hour or two.

    This test program reproduces the problem on two 64-bit Windows 10 laptops and does not reproduce this on one 64-bit Windows 10 laptop. I don't know what difference between these laptops causes two to fail and one to succeed. All have the same driver usbser.sys 10.0.18362.1 winbuild 160101.0800. Problem has occurred when plugged into USB 2.0 and USB 3.0 ports.

    Failing laptops: both Dell, one Windows 10 version 1903 build 18362.657, another Windows 10 version 1909 build 18363.752 (the latest public Windows 10 build)

    Laptop that does not reproduce the problem: Lenovo ThinkPad, Windows 10 version 1903 build 18362.657 (exactly the same as one failing laptop). This one has had Nordic tools like nRF Connect installed. One consequence is that when I plugged in an nRF52840 it would show in Device Manager as 'nRF Connect USB CDC ACM (COMx)'. To make it like my other laptops that do not have Nordic tools, I uninstalled the driver so that the nRF52840 would show up as 'USB Serial Device (COMx)'

  • I have had the new application running on my PC with 5 parallel devices for 8 hours yesterday and 4 hours today. So far, I have not been able to reproduce the issue. I do have another PC that does not have Nordic tools installed, but this is not a Dell PC. I will try to setup the test on that one next. It's a bit harder to find equipment for testing this with the limited factors where you have managed to reproduce the issue, due most of us working from home-office due to the Corona situation. Is it still a requirement to have heavy Bluetooth activity around the device in order to reproduce the issue? This could also be a problem to facilitate during home-office.

  • What is the next step now?

    We can't make any progress on using the nRF52840 until this is fixed, or you find some way for us to avoid the problem.

    If you are trying to reproduce it with the connectivity firmware logging enabled, I already posted that on January 24, and it didn't show any useful information.

  • Hi Paul,

    The point of reproducing it with logging enabled is not just to get similar logs to what you posted on January 24, but to add more logging to the application in order to find where the issue occurs.

    I can now consistently reproduce this issue on the same computer, with two different phones, with UART logging enabled. This means that I should be able to add more logging next week, and hopefully get some more details on why this is happening. It's a bit strange that I'm not able to reproduce with RTT logging enabled, but UART will have to do for now. Unfortunately, our SDK developers are still not able to reproduce it on their end, so, for now, we only have my setup to debug with.

    Have a nice weekend.

    Best regards,
    Jørgen

  • Hi,

    I ran the application many times over the last few days and from studying the logs it seems that this issue occurs right after the application/connectivity FW starts receiving notifications from the connected phone. It could be a coincidence, but I see the same pattern in all the log-files I have checked. Is there any way you could disable the notifications in the phone application, in order to verify if that is what's causing the issue?

    Another customer have reported similar issues with pc-ble-driver, but he was only able to reproduce this with sd_api_v3, not with 6 like you are using. However, it has some other similarities with this issue (I was not able to reproduce his issue when running J-Link RTT Viewer, same as here).

    Best regards,
    Jørgen

  • I agree it seems that the problem is related to receiving notifications. In my reply on January 22 I wrote "The failure to receive events always occurs after I have received notifications for a characteristic."

    I've done an experiment that is much simpler than altering the phone app. The notifications from the phone app are triggered by the Windows app writing to a particular characteristic on the phone. I modified the Windows app so that it (optionally) does not write to that characteristic. With that option enabled, the Windows app runs for many hours without failing (about 18 hours so far). By contrast when I start it without that option (and thus with notifications received), it fails in less than an hour each time.

  • Hi Paul,

    Today we managed to track down what may be a potential race condition in the connectivity firmware. When the scheduler queue gets close to full (above 75%), the processing of events from the softdevice is suspended in order to leave room for serial events from USB, etc. This happens in ser_conn_handlers.c:191. When the queue is processed and free space is below 50%, the processing of softdevice events will continue (main.c:376). What happens when this issue occurs seems to be that the suspend function is called twice in a row, with a call to resume right after the second suspend:

    [00:03:16.456,695] <info> app: nrf_sdh_suspend
    [00:03:16.692,565] <info> app: nrf_sdh_suspend
    [00:03:16.692,596] <info> app: nrf_sdh_resume

    The solution we now are testing is to incorporate the resume section into a critical region:

    CRITICAL_REGION_ENTER();
    if (nrf_sdh_is_suspended())
    {
        // Resume pulling new events if queue utilization drops below 50%.
        if (app_sched_queue_space_get() > (SER_CONN_SCHED_QUEUE_SIZE >> 1))
        {
            nrf_sdh_resume();
        }
    }
    CRITICAL_REGION_EXIT();

    The critical region will prevent the application from processing other application events during this region.

    I have had this workaround running with a clean connectivity FW for 3 hours without hitting the issue now (typically issue happens within 5-15 minutes on my setup).

    Best regards,
    Jørgen

Reply
  • Hi Paul,

    Today we managed to track down what may be a potential race condition in the connectivity firmware. When the scheduler queue gets close to full (above 75%), the processing of events from the softdevice is suspended in order to leave room for serial events from USB, etc. This happens in ser_conn_handlers.c:191. When the queue is processed and free space is below 50%, the processing of softdevice events will continue (main.c:376). What happens when this issue occurs seems to be that the suspend function is called twice in a row, with a call to resume right after the second suspend:

    [00:03:16.456,695] <info> app: nrf_sdh_suspend
    [00:03:16.692,565] <info> app: nrf_sdh_suspend
    [00:03:16.692,596] <info> app: nrf_sdh_resume

    The solution we now are testing is to incorporate the resume section into a critical region:

    CRITICAL_REGION_ENTER();
    if (nrf_sdh_is_suspended())
    {
        // Resume pulling new events if queue utilization drops below 50%.
        if (app_sched_queue_space_get() > (SER_CONN_SCHED_QUEUE_SIZE >> 1))
        {
            nrf_sdh_resume();
        }
    }
    CRITICAL_REGION_EXIT();

    The critical region will prevent the application from processing other application events during this region.

    I have had this workaround running with a clean connectivity FW for 3 hours without hitting the issue now (typically issue happens within 5-15 minutes on my setup).

    Best regards,
    Jørgen

Children
Related