This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

disconnecting while operations are in progress never gives BLE_GAP_EVT_DISCONNECTED event

2020-01-24-092119EST-ProductStoppedGettingEventsFromNordicDK.txtImprivataTestNordicEventsNotReceived.zipCalls_to_pc_ble_driver.cpp0285.2020-02-24-TestProgramUploadedToNordicSupport.zipFeb25TestProgramUploadedToNordicSupport.zipImprivata_bgTestApp.zipbgSDKTestAppMay4.zip2020-05-05-035347-NordicDK_USB840M_200505_ClockInternal_2in1.hex.txt.txtbgSDKTestAppMay6.zipI’m developing an application based on pc-ble-driver to talk to an nRF52840-based dongle (from Fanstel).

I’m having trouble disconnecting cleanly when a connection has operations in progress.  For example, I call ‘sd_ble_gattc_write’, which returns NRF_SUCCESS, but I don’t receive event BLE_GATTC_EVT_WRITE_RSP (after waiting 60 seconds), so I decide to disconnect. When this happens, sd_ble_gap_disconnect returns NRF_SUCCESS, but I do not receive BLE_GAP_EVT_DISCONNECTED even after waiting 30 seconds. The connection supervision timeout is 4 seconds.  What could cause the disconnect to not generate any BLE_GAP_EVT_DISCONNECTED event?

What I’m trying to accomplish here: if a connection is not responsive, I want to end that connection, without disturbing other connections I have open.

Thanks,

Paul Bradford

Parents
  • I have attached a new test program Imprivata_bgTestApp.zip. There is a file called README with instructions, and a description of the problem.   I suggest you try this on a Dell 64-bit Windows 10 laptop that has not had Nordic dev tools installed on it. Usually the problem occurs with in an hour or two.

    This test program reproduces the problem on two 64-bit Windows 10 laptops and does not reproduce this on one 64-bit Windows 10 laptop. I don't know what difference between these laptops causes two to fail and one to succeed. All have the same driver usbser.sys 10.0.18362.1 winbuild 160101.0800. Problem has occurred when plugged into USB 2.0 and USB 3.0 ports.

    Failing laptops: both Dell, one Windows 10 version 1903 build 18362.657, another Windows 10 version 1909 build 18363.752 (the latest public Windows 10 build)

    Laptop that does not reproduce the problem: Lenovo ThinkPad, Windows 10 version 1903 build 18362.657 (exactly the same as one failing laptop). This one has had Nordic tools like nRF Connect installed. One consequence is that when I plugged in an nRF52840 it would show in Device Manager as 'nRF Connect USB CDC ACM (COMx)'. To make it like my other laptops that do not have Nordic tools, I uninstalled the driver so that the nRF52840 would show up as 'USB Serial Device (COMx)'

  • I have had the new application running on my PC with 5 parallel devices for 8 hours yesterday and 4 hours today. So far, I have not been able to reproduce the issue. I do have another PC that does not have Nordic tools installed, but this is not a Dell PC. I will try to setup the test on that one next. It's a bit harder to find equipment for testing this with the limited factors where you have managed to reproduce the issue, due most of us working from home-office due to the Corona situation. Is it still a requirement to have heavy Bluetooth activity around the device in order to reproduce the issue? This could also be a problem to facilitate during home-office.

  • Hi Paul,

    The point of reproducing it with logging enabled is not just to get similar logs to what you posted on January 24, but to add more logging to the application in order to find where the issue occurs.

    I can now consistently reproduce this issue on the same computer, with two different phones, with UART logging enabled. This means that I should be able to add more logging next week, and hopefully get some more details on why this is happening. It's a bit strange that I'm not able to reproduce with RTT logging enabled, but UART will have to do for now. Unfortunately, our SDK developers are still not able to reproduce it on their end, so, for now, we only have my setup to debug with.

    Have a nice weekend.

    Best regards,
    Jørgen

  • Hi,

    I ran the application many times over the last few days and from studying the logs it seems that this issue occurs right after the application/connectivity FW starts receiving notifications from the connected phone. It could be a coincidence, but I see the same pattern in all the log-files I have checked. Is there any way you could disable the notifications in the phone application, in order to verify if that is what's causing the issue?

    Another customer have reported similar issues with pc-ble-driver, but he was only able to reproduce this with sd_api_v3, not with 6 like you are using. However, it has some other similarities with this issue (I was not able to reproduce his issue when running J-Link RTT Viewer, same as here).

    Best regards,
    Jørgen

  • I agree it seems that the problem is related to receiving notifications. In my reply on January 22 I wrote "The failure to receive events always occurs after I have received notifications for a characteristic."

    I've done an experiment that is much simpler than altering the phone app. The notifications from the phone app are triggered by the Windows app writing to a particular characteristic on the phone. I modified the Windows app so that it (optionally) does not write to that characteristic. With that option enabled, the Windows app runs for many hours without failing (about 18 hours so far). By contrast when I start it without that option (and thus with notifications received), it fails in less than an hour each time.

  • Hi Paul,

    Today we managed to track down what may be a potential race condition in the connectivity firmware. When the scheduler queue gets close to full (above 75%), the processing of events from the softdevice is suspended in order to leave room for serial events from USB, etc. This happens in ser_conn_handlers.c:191. When the queue is processed and free space is below 50%, the processing of softdevice events will continue (main.c:376). What happens when this issue occurs seems to be that the suspend function is called twice in a row, with a call to resume right after the second suspend:

    [00:03:16.456,695] <info> app: nrf_sdh_suspend
    [00:03:16.692,565] <info> app: nrf_sdh_suspend
    [00:03:16.692,596] <info> app: nrf_sdh_resume

    The solution we now are testing is to incorporate the resume section into a critical region:

    CRITICAL_REGION_ENTER();
    if (nrf_sdh_is_suspended())
    {
        // Resume pulling new events if queue utilization drops below 50%.
        if (app_sched_queue_space_get() > (SER_CONN_SCHED_QUEUE_SIZE >> 1))
        {
            nrf_sdh_resume();
        }
    }
    CRITICAL_REGION_EXIT();

    The critical region will prevent the application from processing other application events during this region.

    I have had this workaround running with a clean connectivity FW for 3 hours without hitting the issue now (typically issue happens within 5-15 minutes on my setup).

    Best regards,
    Jørgen

Reply Children
  • After more testing on several different computers, I'm still seeing the original problem. On one computer I don't see it but on two others I do. I also see some errors from pc-ble-driver when doing write operations to the USB serial port, including:

    Error:  serial port write operation on port COM15 failed. Error: A device which does not exist was specified.[433]

    Error:  serial port write operation on port COM15 failed. Error: The device does not recognize the command.[22]

    Error:  serial port write operation on port COM15 failed. Error: A device attached to the system is not functioning.[31]

    I'll continue testing.

  • The serial port errors may indicate that the USB was detached from the computer, for instance by a device reset. I have seen this behavior with the approach used in the other customer case I linked you before. I traced this to an issue with the scheduler queue size, and increasing this seemed to solve the issue on my setup. I posted an updated hex-file in this post, which increases the scheduler queue size to 32 (from 16). If this is not enough, I also built a version with a queue size of 64: connectivity_4.1.1_usb_with_s140_6.1.1_critical_region_fix_increased_sched_queue_size64_log_enabled.hex

    This one also has UART logging enabled for errors only. I would appreciate if you could check the logs if you are still able to reproduce the issue with this new hex-file. 

    Did you always see the original problem together with this new problem? Did you manage to reproduce this on the Nordic Dongle/DK, or only on Fanstel Dongle? I received the firmware for Fanstel and have had it running without issues for 2 hours now.

  • It doesn't fail on every computer.   I will work on testing the new connectivity firmware on Fanstel and the Nordic dev-dongle and the Nordic DK.

  • Hi Jørgen,

    I'm trying to test the fix on two Nordic dev-dongles (PCA10059), programmed using nRF Connect Programmer. Ideally I'd like to make sure the dev-dongle is in a known good state so I'd like to use the "Erase & write" feature. When I Add File  connectivity_4.1.1_usb_with_s140_6.1.1_critical_region_fix_increased_sched_queue_size.hex, nRF Connect shows that it has Connectivity, SoftDevice and Master Boot record (it shows a region name "MBR or Application" with address range 0 to 0xB00 which is the Master Boot Record range). It does not show a bootloader. I'm confused about why it would have MBR, and why if it has MBR it doesn't have a bootloader.   

    I went ahead with "Erase & write" and am using those dev-dongles with my test program.

    So far, using these dev-dongles and new firmware, I have still seen the original 'no event' problem on two computers. 

    -Paul

  • Hi Paul,

    The "Erase & write" function should not be available if you program the nRF52840 Dongle (PCA10059) through the USB port of the dongle. Did you connect a J-Link debugger to program the dongle through SWD? If you did a "Erase all" operation through the SWD interface, the Bootloader that comes with the dongle will have been erased. "Erase all" operations will erase all programmable flash, including UICR registers.

    The MBR is part of the softdevice-hex. It will always be present whenever the softdevice is flashed to the chip. You can read more about that in the softdevice specifications.

    Paul Bradford said:
    So far, using these dev-dongles and new firmware, I have still seen the original 'no event' problem on two computers.

     Do you mean with the firmware with increased scheduler queue size or the one with only critical region fix?

    Best regards,
    Jørgen

Related